Returning standard deviation with `BaggingRegressor` - scikit-learn

Is there a way to return standard deviation using sklearn.ensemble.BaggingRegressor?
Cause by looking at several examples all that I have found has been the mean prediction.

You can always get the underlying predictions by each estimator of the ensemble, which (estimator) is accessible through the estimators_ attribute of the ensemble, and handle these predictions accordingly (compute mean, standard deviation, etc).
Adapting the example from the documentation, with an ensemble of 10 SVR base estimators:
import numpy as np
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4,
n_informative=2, n_targets=1,
random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(),
n_estimators=10, random_state=0).fit(X, y)
regr.predict([[0, 0, 0, 0]]) # get (mean) prediction for a single sample, [0, 0, 0, 0]
# array([-2.87202411])
# get the predictions from each individual member of the ensemble using a list comprehension:
raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]
raw_pred
# result:
[array([-2.13003431]),
array([-1.96224516]),
array([-1.90429596]),
array([-6.90647796]),
array([-6.21360547]),
array([-1.84318744]),
array([1.82285686]),
array([4.62508622]),
array([-5.60320499]),
array([-8.60513286])]
# get the mean, and ensure that it is the same with the one returned above with the .predict method of the ensemble:
np.mean(raw_pred)
# -2.8720241079257436
np.mean(raw_pred) == regr.predict([[0, 0, 0, 0]]) # sanity check
# True
# get the standard deviation:
np.std(raw_pred)
# 3.865135037828279

There is not a builtin way to do that, no.
The fitted estimators are available in the estimators_ attribute, along with estimators_features_ if you've set max_features < 1.0, so you can reproduce the individual predictions manually.

Related

RandomizedSearchCV sampling distribution

According to RandomizedSearchCV documentation (emphasis mine):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or
lists of parameters to try. Distributions must provide a rvs method
for sampling (such as those from scipy.stats.distributions). If a list
is given, it is sampled uniformly. If a list of dicts is given, first
a dict is sampled uniformly, and then a parameter is sampled using
that dict as above.
If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
param_grid = [
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
'feature_selection__n_features_to_select': [3],
'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'classification__max_depth': [2, 5, 10],
},
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=LogisticRegression())],
'feature_selection__n_features_to_select': [3],
'classification': [LogisticRegression()],
'classification__C': [0.1],
},
]
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
('classification', LogisticRegression())])
classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
scoring='neg_brier_score', n_jobs=-1, verbose=10)
data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)
What happens though is that every time I run it XGBClassifier gets chosen 10/10 times. I would expect one candidate to come from Logistic Regresion since the probability for each dict to be sampled is 50-50.
If the search space between the two algoritms is more balanced ('classification__n_estimators': [100]) then the sampling works as expected.
Can someone clarify what's going on here?
Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid, which means it will disproportionately choose points from the larger dictionary-grid from your list.
Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose?

fit_transform vs transform when doing inference

I have trained a keras model and saved it. I now want to use the model in a web app for inference. I want to preprocess the inputs by scaling them using StandardScaler() from sklearn.
But whenever i run transform(inputs) an error occurs wanting me to do fitting first. This was the code
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.transform(inputs)
preds = model.predict(inputs, batch_size = 1)
I then changed the code inorder to do fitting
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.fit_transform(inputs)
preds = model.predict(inputs, batch_size = 1)
It worked but the scaled data are all bunch of zeros regardless of the inputs i provide, making wrong predicitions. Am certain am missing some key concepts here, i am asking for help. Thank you
The standard scaler function has formula:
z = (x - u) / s
Here,
x: Element
u: Mean
s: Standard Deviation
This element transformation is done column-wise.
Therefore, when you call to fit the values of mean and standard_deviation are calculated.
Eg:
from sklearn.preprocessing import StandardScaler
import numpy as np
x = np.random.randint(50,size = (10,2))
x
Output:
array([[26, 9],
[29, 39],
[23, 26],
[29, 22],
[28, 41],
[11, 6],
[42, 40],
[ 1, 25],
[ 0, 39],
[44, 45]])
Now, fitting the standard scaler
scale = StandardScaler()
scale.fit(x)
You can see the mean and standard deviation using the built methods for the StandardScaler object
# Mean
scale.mean_ # array([23.3, 29.2])
# Standard Deviation
scale.scale_ # array([14.36697602, 13.12859475])
You transform these values using the transform method.
scale.transform(x)
Output:
array([[ 0.18793099, -1.53862621],
[ 0.3967432 , 0.74646222],
[-0.02088122, -0.24374277],
[ 0.3967432 , -0.54842122],
[ 0.32713913, 0.89880145],
[-0.85613006, -1.76713506],
[ 1.3015961 , 0.82263184],
[-1.55217075, -0.31991238],
[-1.62177482, 0.74646222],
[ 1.44080424, 1.20347991]])
Calculation for 1st element:
z = (26 - 23.3) / 14.36697602
z = 0.18793099
How to use this?
The transformation should be done before training your model. The training should be done on transformed data. And for the prediction, the test data should use the same mean and standard deviation values as your training data. ie. Do not use fit method on the test data. You should use the object that was used to transform the training data to transform your test data.

how does sklearn compute the Accuracy score step by step?

I was reading about the metrics used in sklearn but I find pretty confused the following:
In the documentation sklearn provides a example of its usage as follows:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
0.5
I understood that sklearns computes that metric as follows:
I am not sure about the process, I would like to appreciate if some one could explain more this result step by step since I was studying it but I found hard to understand, In order to understand more I tried the following case:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3,0]
y_true = [0, 1, 2, 3,0]
print(accuracy_score(y_true, y_pred))
0.6
And I supposed that the correct computation would be the following:
but I am not sure about it, I would like to see if someone could support me with the computation rather than copy and paste the sklearn's documentation.
I have the doubt if the i in the sumatory is the same as the i in the formula inside the parenthesis, it is unclear to me, I don't know if the number of elements in the sumatory is related just to the number of elements in the sample of if it depends on also by the number of classes.
The indicator function equals one only if the variables in its arguments are equal, else it’s value is zero. Therefor when y is equal to yhat the indicator function produces a one counting as a correct classification. There is a code example in python and numerical example below.
import numpy as np
yhat=np.array([0,2,1,3])
y=np.array([0,1,2,3])
acc=np.mean(y==yhat)
print( acc)
example
A simple way to understand the calculation of the accuracy is:
Given two lists, y_pred and y_true, for every position index i, compare the i-th element of y_pred with the i-th element of y_true and perform the following calculation:
Count the number of matches
Divide it by the number of samples
So using your own example:
y_pred = [0, 2, 1, 3, 0]
y_true = [0, 1, 2, 3, 0]
We see matches on indices 0, 3 and 4. Thus:
number of matches = 3
number of samples = 5
Finally, the accuracy calculation:
accuracy = matches/samples
accuracy = 3/5
accuracy = 0.6
And for your question about the i index, it is the sample index, so it is the same for both the summation index and the Y/Yhat index.

How to plot accuracy bars for each feature of an array

I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).

Scikit Learn Logistic Regression confusion

I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)
Age ZepplinFan
0 13 0
1 25 0
2 40 1
3 51 0
4 55 1
5 58 1
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])
# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826
# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]
# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()
Obviously this doesn't appear to be a good fit. Furthermore, when I do this exercise in R I get different coefficients and a model that makes more sense.
lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]
mylogit$coeff
(Intercept) Age
-4.8434 0.1148
ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))
What am I missing here?
The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C is used to control the regularization, in your case:
lr = LogisticRegression(C=100)
will generate what you are looking for:
As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. The reason is also regularization or rather how it affects estimation of the bias in the regression. The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias.
For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Resources