What does .transform() exactly do in sklearn StandardScaler? - scikit-learn

I am studying StandardScaler right now. However, I do not understand what does .transform exactly do. In API document, it just say "Perform standardization by centering and scaling" Can someone explain?

The standard scaler function has formula:
z = (x - u) / s
Here,
x: Element
u: Mean
s: Standard Deviation
This element transformation is done column-wise.
Therefore, when you call to fit the values of mean and standard_deviation are calculated.
Eg:
from sklearn.preprocessing import StandardScaler
import numpy as np
x = np.random.randint(50,size = (10,2))
x
Output:
array([[26, 9],
[29, 39],
[23, 26],
[29, 22],
[28, 41],
[11, 6],
[42, 40],
[ 1, 25],
[ 0, 39],
[44, 45]])
Now, fitting the standard scaler
scale = StandardScaler()
scale.fit(x)
You can see the mean and standard deviation using the built methods for the StandardScaler object
# Mean
scale.mean_ # array([23.3, 29.2])
# Standard Deviation
scale.scale_ # array([14.36697602, 13.12859475])
You transform these values using the transform method.
scale.transform(x)
Output:
array([[ 0.18793099, -1.53862621],
[ 0.3967432 , 0.74646222],
[-0.02088122, -0.24374277],
[ 0.3967432 , -0.54842122],
[ 0.32713913, 0.89880145],
[-0.85613006, -1.76713506],
[ 1.3015961 , 0.82263184],
[-1.55217075, -0.31991238],
[-1.62177482, 0.74646222],
[ 1.44080424, 1.20347991]])
Calculation for 1st element:
z = (26 - 23.3) / 14.36697602
z = 0.18793099

If you click on [source] on the right side you can see the source code. From lines 796 to 807 you'll see
if sparse.issparse(X):
if self.with_mean:
raise ValueError(
"Cannot center sparse matrices: pass `with_mean=False` "
"instead. See docstring for motivation and alternatives.")
if self.scale_ is not None:
inplace_column_scale(X, 1 / self.scale_)
else:
if self.with_mean:
X -= self.mean_
if self.with_std:
X /= self.scale_
You can see that it performs standardization in the if-else block.

Related

What is the the role of using OneVsRestClassifier wrapper around XGBClassifier?

I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.

fit_transform vs transform when doing inference

I have trained a keras model and saved it. I now want to use the model in a web app for inference. I want to preprocess the inputs by scaling them using StandardScaler() from sklearn.
But whenever i run transform(inputs) an error occurs wanting me to do fitting first. This was the code
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.transform(inputs)
preds = model.predict(inputs, batch_size = 1)
I then changed the code inorder to do fitting
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.fit_transform(inputs)
preds = model.predict(inputs, batch_size = 1)
It worked but the scaled data are all bunch of zeros regardless of the inputs i provide, making wrong predicitions. Am certain am missing some key concepts here, i am asking for help. Thank you
The standard scaler function has formula:
z = (x - u) / s
Here,
x: Element
u: Mean
s: Standard Deviation
This element transformation is done column-wise.
Therefore, when you call to fit the values of mean and standard_deviation are calculated.
Eg:
from sklearn.preprocessing import StandardScaler
import numpy as np
x = np.random.randint(50,size = (10,2))
x
Output:
array([[26, 9],
[29, 39],
[23, 26],
[29, 22],
[28, 41],
[11, 6],
[42, 40],
[ 1, 25],
[ 0, 39],
[44, 45]])
Now, fitting the standard scaler
scale = StandardScaler()
scale.fit(x)
You can see the mean and standard deviation using the built methods for the StandardScaler object
# Mean
scale.mean_ # array([23.3, 29.2])
# Standard Deviation
scale.scale_ # array([14.36697602, 13.12859475])
You transform these values using the transform method.
scale.transform(x)
Output:
array([[ 0.18793099, -1.53862621],
[ 0.3967432 , 0.74646222],
[-0.02088122, -0.24374277],
[ 0.3967432 , -0.54842122],
[ 0.32713913, 0.89880145],
[-0.85613006, -1.76713506],
[ 1.3015961 , 0.82263184],
[-1.55217075, -0.31991238],
[-1.62177482, 0.74646222],
[ 1.44080424, 1.20347991]])
Calculation for 1st element:
z = (26 - 23.3) / 14.36697602
z = 0.18793099
How to use this?
The transformation should be done before training your model. The training should be done on transformed data. And for the prediction, the test data should use the same mean and standard deviation values as your training data. ie. Do not use fit method on the test data. You should use the object that was used to transform the training data to transform your test data.

How to plot accuracy bars for each feature of an array

I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).

How to project a new point to new basis using 'components_' attribute of PCA from sklearn.decomposition package?

I have some data points with 3 co-ordinates and using PCA function I converted it into a points having 2 co-ordinates by doing this
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, -3], [-2, -1, -1], [-3, -2, -2], [1, 1, 1], [2, 1, 5], [3, 2, 6]]) #data
pca = PCA(n_components=2)
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
XT = pca.fit_transform(X)
print XT
#output obtained
#[[-4.04510516 -1.24556106]
#[-2.92607624 0.61239898]
#[-4.55000611 1.13825234]
#[ 0.81687144 -1.11632484]
#[ 4.5401931 0.56854397]
#[ 6.16412297 0.04269061]]
The I got principal axes in feature space, representing the directions of maximum variance in the data using 'components_' attribute
W = (pca.components_)
print W
# output obtained
#[[ 0.49508794 0.3217835 0.80705843]
# [-0.67701709 -0.43930775 0.59047148]]
Now I wanted to project the first point [-1, -1, -3] (which is first point in X) onto 2D subspace using 'components_' attribute by doing this
projectedXT_0 = np.dot(W,X[0])
print projectedXT_0
#output obtained
#[-3.23804673 -0.65508959]
#expected output
#[-4.04510516 -1.24556106]
I am not getting what I expected so, obviously I am doing something wrong while calculating projectedPoint using 'components_' attribute. Kindly demonstrate the use of 'components_' attribute to get projection of a point.
NOTE: I know 'transform' function does this but I want do using 'components_' attribute.
You forgot to substract the mean.
See the source of pca transform:
if self.mean_ is not None:
X = X - self.mean_
X_transformed = fast_dot(X, self.components_.T)
if self.whiten:
X_transformed /= np.sqrt(self.explained_variance_)
return X_transformed
see Projecting new samples into existing PCA space?
it is pca.transform(...)
Also the two lines you have:
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
don't do anything and you should use fit_transform()

Scikit Learn Logistic Regression confusion

I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)
Age ZepplinFan
0 13 0
1 25 0
2 40 1
3 51 0
4 55 1
5 58 1
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])
# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826
# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]
# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()
Obviously this doesn't appear to be a good fit. Furthermore, when I do this exercise in R I get different coefficients and a model that makes more sense.
lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]
mylogit$coeff
(Intercept) Age
-4.8434 0.1148
ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))
What am I missing here?
The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C is used to control the regularization, in your case:
lr = LogisticRegression(C=100)
will generate what you are looking for:
As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. The reason is also regularization or rather how it affects estimation of the bias in the regression. The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias.
For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Resources