What is the the role of using OneVsRestClassifier wrapper around XGBClassifier? - python-3.x

I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.

Related

Multiclass classification per class recall equals per class accuracy?

I've got a multiclass problem. I'm using sklearn.metrics to calculate the confusion matrix, overall accuracy, per class precision, per class recall and per class F1-score.
Now I wanted to calculate the per class accuracy. Since there is no method in sklearn for this I used another one which i got from a google search. I've now realised, that the per class recall equals the per class accuracy. Can anyone explain to me if this holds true and if yes, why?
I found an explanation here, but I'm not sure since there the micro-recall equals the overall accuracy if I'm understanding it correctly. And I'm looking for the per class accuracy.
I too experienced same results. because per class Recall = TP/TP+FN , Here TP+FN is same as all the samples of a class. So the formula becomes similar to accuracy.
This generally doesn't hold. Accuracy and recall are calculated using different formulas and are different measures explaining something else.
Recall is the percentage of true positive data points compared to all data points that are predicted as positive by your classifier.
Accuracy is the percentage of all examples that are classified correctly, including positive and negative.
If they are equal, this is either coincidence or you have an error is your method of calculating them. Most likely this will be coincidence.
EDIT:
I will show why it's not the case with an example that can be generalised to N classes.
Let's assume three classes: 0, 1, 2 with the following confusion matrix:
[[3 0 1]
[2 5 0]
[0 1 4]]
When we want to calculate measures per class, we do this binary. For example for class 0, we combine 1 and 2 into 'not 0'. This results in the following confusion matrix:
[[3 1]
[2 9]]
Resulting in:
TP = 3
FT = 5
FN = 1
TN = 9
Accuracy = (TN + TP) / (N + P)
Recall = TP / (TN + FN)
So you can already tell from these formulas, that they are really not equal. To disprove an hypothesis in mathematics it suffices to show a counter example. In this case an example that show that accuracy is not equal to recall.
In this example filled in we get:
Accuracy = 12/18 = 2/3
Recall = 3/4
And 2/3 is not equal to 3/4. Thus disproving the hypothesis that per class accuracy is equal to per class recall.
It is however also possible to provide examples for which the hypothesis is correct. But because it is not in general, it is disproven.
Not sure if you are looking for average per-class accuracy as a single metric or per-class accuracy as separate metrics for each class.
For per-class accuracy as a separate metric for each class, see the code below. It's the same as recall-micro per class.
For average per-class accuracy as a single metric, it is equivalent to recall-macro (which is equivalent to balanced accuracy in sklearn). See the code blow.
Here is the empirical demonstration in code.
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score
label_class1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels = label_class1 + label_class2
pred_class1 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
pred_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
pred = pred_class1 + pred_class2
# 1. calculate accuracy scores per class
score_accuracy_class1 = accuracy_score(label_class1, pred_class1)
score_accuracy_class2 = accuracy_score(label_class2, pred_class2)
print(score_accuracy_class1) # 0.6
print(score_accuracy_class2) # 0.9
# 2. calculate recall scores per class
score_recall_class1 = recall_score(label_class1, pred_class1, average='micro')
score_recall_class2 = recall_score(label_class2, pred_class2, average='micro')
print(score_recall_class1) # 0.6
print(score_recall_class2) # 0.9
assert score_accuracy_class1 == score_recall_class1
assert score_accuracy_class2 == score_recall_class2
# 3. this also means that average per-class accuracy is equivalent to averaged recall and balanced accuracy
score_balanced_accuracy1 = (score_accuracy_class1 + score_accuracy_class2) / 2
score_balanced_accuracy2 = (score_recall_class1 + score_recall_class2) / 2
score_balanced_accuracy3 = balanced_accuracy_score(labels, pred)
score_balanced_accuracy4 = recall_score(labels, pred, average='macro')
print(score_balanced_accuracy1) # 0.75
print(score_balanced_accuracy2) # 0.75
print(score_balanced_accuracy3) # 0.75
print(score_balanced_accuracy4) # 0.75
# balanced accuracy, average per-class accuracy and recall-macro are equivalent
assert score_balanced_accuracy1 == score_balanced_accuracy2 == score_balanced_accuracy3 == score_balanced_accuracy4
These official docs say: "balanced accuracy ... is defined as the average of recall obtained on each class."
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

how does sklearn compute the Accuracy score step by step?

I was reading about the metrics used in sklearn but I find pretty confused the following:
In the documentation sklearn provides a example of its usage as follows:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
0.5
I understood that sklearns computes that metric as follows:
I am not sure about the process, I would like to appreciate if some one could explain more this result step by step since I was studying it but I found hard to understand, In order to understand more I tried the following case:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3,0]
y_true = [0, 1, 2, 3,0]
print(accuracy_score(y_true, y_pred))
0.6
And I supposed that the correct computation would be the following:
but I am not sure about it, I would like to see if someone could support me with the computation rather than copy and paste the sklearn's documentation.
I have the doubt if the i in the sumatory is the same as the i in the formula inside the parenthesis, it is unclear to me, I don't know if the number of elements in the sumatory is related just to the number of elements in the sample of if it depends on also by the number of classes.
The indicator function equals one only if the variables in its arguments are equal, else it’s value is zero. Therefor when y is equal to yhat the indicator function produces a one counting as a correct classification. There is a code example in python and numerical example below.
import numpy as np
yhat=np.array([0,2,1,3])
y=np.array([0,1,2,3])
acc=np.mean(y==yhat)
print( acc)
example
A simple way to understand the calculation of the accuracy is:
Given two lists, y_pred and y_true, for every position index i, compare the i-th element of y_pred with the i-th element of y_true and perform the following calculation:
Count the number of matches
Divide it by the number of samples
So using your own example:
y_pred = [0, 2, 1, 3, 0]
y_true = [0, 1, 2, 3, 0]
We see matches on indices 0, 3 and 4. Thus:
number of matches = 3
number of samples = 5
Finally, the accuracy calculation:
accuracy = matches/samples
accuracy = 3/5
accuracy = 0.6
And for your question about the i index, it is the sample index, so it is the same for both the summation index and the Y/Yhat index.

How to plot accuracy bars for each feature of an array

I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).

Scikit Learn Logistic Regression confusion

I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)
Age ZepplinFan
0 13 0
1 25 0
2 40 1
3 51 0
4 55 1
5 58 1
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])
# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826
# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]
# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()
Obviously this doesn't appear to be a good fit. Furthermore, when I do this exercise in R I get different coefficients and a model that makes more sense.
lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]
mylogit$coeff
(Intercept) Age
-4.8434 0.1148
ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))
What am I missing here?
The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C is used to control the regularization, in your case:
lr = LogisticRegression(C=100)
will generate what you are looking for:
As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. The reason is also regularization or rather how it affects estimation of the bias in the regression. The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias.
For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Resources