PCA for features selection - scikit-learn

i'm dealing with Isolation Forest for APT classification and it gets encouraging results.
Now i'm trying to improve the performance with feature selection, and i consider PCA approach.
df = pd.read_csv("features_labeled_PULITOIsolationFor.csv")
apt_list = list(set(df.apt))[1:] ## Il primo รจ nan quindi lo scarto
for apt_name in apt_list:
out_file = open("test.txt","a")
out_file.write(apt_name + "\n" + "\n")
out_file.close()
df_apt17 = df[df["apt"] == apt_name]
df_other = df[df["apt"] != apt_name]
nf = 150
pca = PCA(n_components=nf)
df_apt17 = pca.fit_transform(df_apt17.drop("apt", 1))
print(df_apt17.shape)
kf = KFold(n_splits=10, random_state=1, shuffle=True)
df_apt17 = df_apt17.reset_index()
df_other = df_other.reset_index()
After this, i perform Cross Validation dividing the apt in question in Kfolder and i use df_other dataframe for testing folder merged with some elements of current APT.
However, although PCA seems to work since the reduction of features (seen by .shape on dataframe), it gives me error on the reset_index() function:
df_apt17 = df_apt17.reset_index()
AttributeError: 'numpy.ndarray' object has no attribute 'reset_index'
how can i deal with this problem?
Thanks to everyone

By reading the docs in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) it say that the method pca.fit_transform() return an array.
Arrays don't have reset_index() method, only pandas.DataFrame.
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters: X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features
is the number of features.
y : Ignored.
Returns: X_new : array-like, shape (n_samples, n_components)
If you need to use reset_index(), you need transform it back to a pandas.DataFrame.

Related

Find wrongly categorized samples from validation step

I am using a keras neural net for identifying category in which the data belongs.
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy])
Fit function
history = self.model.fit(self.X,
{'output': self.Y},
validation_split=0.3,
epochs=400,
batch_size=32
)
I am interested in finding out which labels are getting categorized wrongly in the validation step. Seems like a good way to understand what is happening under the hood.
You can use model.predict_classes(validation_data) to get the predicted classes for your validation data, and compare these predictions with the actual labels to find out where the model was wrong. Something like this:
predictions = model.predict_classes(validation_data)
wrong = np.where(predictions != Y_validation)
If you are interested in looking 'under the hood', I'd suggest to use
model.predict(validation_data_x)
to see the scores for each class, for each observation of the validation set.
This should shed some light on which categories the model is not so good at classifying. The way to predict the final class is
scores = model.predict(validation_data_x)
preds = np.argmax(scores, axis=1)
be sure to use the proper axis for np.argmax (I'm assuming your observation axis is 1). Use preds to then compare with the real class.
Also, as another exploration you want to see the overall accuracy on this dataset, use
model.evaluate(x=validation_data_x, y=validation_data_y)
I ended up creating a metric which prints the "worst performing category id + score" on each iteration. Ideas from link
import tensorflow as tf
import numpy as np
class MaxIoU(object):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def max_iou(self, y_true, y_pred):
# Wraps np_max_iou method and uses it as a TensorFlow op.
# Takes numpy arrays as its arguments and returns numpy arrays as
# its outputs.
return tf.py_func(self.np_max_iou, [y_true, y_pred], tf.float32)
def np_max_iou(self, y_true, y_pred):
# Compute the confusion matrix to get the number of true positives,
# false positives, and false negatives
# Convert predictions and target from categorical to integer format
target = np.argmax(y_true, axis=-1).ravel()
predicted = np.argmax(y_pred, axis=-1).ravel()
# Trick from torchnet for bincounting 2 arrays together
# https://github.com/pytorch/tnt/blob/master/torchnet/meter/confusionmeter.py
x = predicted + self.num_classes * target
bincount_2d = np.bincount(x.astype(np.int32), minlength=self.num_classes**2)
assert bincount_2d.size == self.num_classes**2
conf = bincount_2d.reshape((self.num_classes, self.num_classes))
# Compute the IoU and mean IoU from the confusion matrix
true_positive = np.diag(conf)
false_positive = np.sum(conf, 0) - true_positive
false_negative = np.sum(conf, 1) - true_positive
# Just in case we get a division by 0, ignore/hide the error and set the value to 0
with np.errstate(divide='ignore', invalid='ignore'):
iou = false_positive / (true_positive + false_positive + false_negative)
iou[np.isnan(iou)] = 0
return np.max(iou).astype(np.float32) + np.argmax(iou).astype(np.float32)
~
usage:
custom_metric = MaxIoU(len(catagories))
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy, custom_metric.max_iou])

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.
You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})
You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

PySpark LinearRegressionWithSGD, model predict dimensions mismatch

I've come across the following error:
AssertionError: dimension mismatch
I've trained a linear regression model using PySpark's LinearRegressionWithSGD.
However when I try to make a prediction on the training set, I get "dimension mismatch" error.
Worth mentioning:
Data was scaled using StandardScaler, but the predicted value was not.
As can be seen in code the features used for training were generated by PCA.
Some code:
pca_transformed = pca_model.transform(data_std)
X = pca_transformed.map(lambda x: (x[0], x[1]))
data = train_votes.zip(pca_transformed)
labeled_data = data.map(lambda x : LabeledPoint(x[0], x[1:]))
linear_regression_model = LinearRegressionWithSGD.train(labeled_data, iterations=10)
The prediction is the source of the error, and these are the variations I tried:
pred = linear_regression_model.predict(pca_transformed.collect())
pred = linear_regression_model.predict([pca_transformed.collect()])
pred = linear_regression_model.predict(X.collect())
pred = linear_regression_model.predict([X.collect()])
The regression weights:
DenseVector([1.8509, 81435.7615])
The vectors used:
pca_transformed.take(1)
[DenseVector([-0.1745, -1.8936])]
X.take(1)
[(-0.17449817243564397, -1.8935926689554488)]
labeled_data.take(1)
[LabeledPoint(22221.0, [-0.174498172436,-1.89359266896])]
This worked:
pred = linear_regression_model.predict(pca_transformed)
pca_transformed is of type RDD.
The function handles RDD's and arrays differently:
def predict(self, x):
"""
Predict the value of the dependent variable given a vector or
an RDD of vectors containing values for the independent variables.
"""
if isinstance(x, RDD):
return x.map(self.predict)
x = _convert_to_vector(x)
return self.weights.dot(x) + self.intercept
When a simple array is used, there might be a dimension mismatch issue (like the error in the question above).
As can be seen, if x is not an RDD, it's being converted to a vector. The thing is the dot product will not work unless you take x[0].
Here is the error reproduced:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j) + linear_regression_model.intercept
This works just fine:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j[0]) + linear_regression_model.intercept

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources