Sklearn incorrect support value (number of samples in each class) for classification report - scikit-learn

I am fitting an SVM to some data using sklearn. I have 24 samples in total (10 negative, 14 positive).
# Set model
clf = svm.SVC(kernel = 'linear', C = 1)
# Create train, test splits and fit SVM to data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.3, stratify = y)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
I stratified by y to ensure I have an equal number of each class in my test set, which seems to have worked (see image below), however, the classification report says there are no negative samples:

The signature for classification_report is (y_true, y_pred, ...); you've reversed the inputs.
Here's one of the places where using explicit keyword arguments is a good practice.

Related

Different results using OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2)) compared to KNeighborsClassifier(n_neighbors=2)

I'm implementing a multi-class classifier and I'm getting different results when wrapping KNN in a multi-class classifier.
Unsure why as I understood KNN worked for multiclass already?
y = rock_df['Sample_type']
X = rock_df[col_list]
def model_eval(model, X,y):
""" Function implements classifier model on X and y with a 0.33 test hold out, stratified by y and returns accuracy and standard deviation
Inputs:
model: The ML model to be tested
X: the cleaned and preprocessed data (normalized, and NAN dealt with)
y: Target labels for input data X
"""
#Split train /test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify = y)
n = X_test.size
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#Scoring
confusion_matrix(y_test,y_pred)
balanced_accuracy_score( y_test, y_pred)
scores = cross_val_score(model, X, y, cv=3)
mean= scores.mean()
sd = scores.std()
print("For {} : {:.1%} accuracy on cross validation, with a standard deviation of {:.1%}".format(model, mean, sd) )
# binomial confidence interval - 95% -- confirm difference with SD
#interval = 1.96 * sqrt( (mean * (1 - mean)) /n )
#print('Confidence Interval: {:.3%}'.format(interval) )
#return balanced_accuracy_score, confusion_matrix
model = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2))
model_eval(model, X,y)
model = KNeighborsClassifier(n_neighbors=2)
model_eval(model, X,y)
First model I get:
For OneVsRestClassifier(estimator=KNeighborsClassifier(n_neighbors=2)) : 78.6% accuracy on cross validation, with a standard deviation of 5.8%
second:
For KNeighborsClassifier(n_neighbors=2) : 83.3% accuracy on cross validation, with a standard deviation of 8.9%
thanks
It is OK that you have different results. KNeighborsClassifier doesn't employ one-vs-rest strategy; majority vote works with 3 and more classes and there is no need to have OvR in the original implementation. But trying OneVsRestClassifier might be useful as well. I believe that generally decision boundaries will be different. Here I played with Iris dataset to get decision boundaries using KNeighborsClassifier(n_neighbors=5) and OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5)):

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

Why is my detection score high inspite of obvious misclassifications during prediction?

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?
# Train set contain X_train (dataframe of features) and Y_train (series
# of target labels)
# Test set contain X_test and Y_test
# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')
#Training
clf.fit(X_train, Y_train)
# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames =
['Predicted'])
# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() *
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std()
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))
I got accuracy, precision, recall and f-score of
Accuracy 0.99825
Precision 0.99826
Recall 0.99825
F-Score 0.99825
However, the confusion matrix showed otherwise
Predicted 9670 41
Actual 5113 2347
Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?
Your predicted values are stored in y_pred.
accuracy_score(y_test,y_pred)
Just check whether this works...
You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test).
However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.
So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.

Scaling features by using polynomial features

I understand the concept of polynomial regression and the use of PolynomialFeatures by sklearn.
But to get a concrete hold over the concept of using PolynomialFeature expansion, I just wanted to ask about a scenario.
Here, I want to consider a scenario of polynomial features with KNN instead of regression.
This is how we use polynomialFeatures for regression.
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
fin = LinearRegression().fit(X_train, y_train)
Now, what if I do this:
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
In former we are doing regression (I get it. OK) and in later we are using KNN.
Does it make sense to apply KNN here and is right(in the first place)?
How are the features being transformed in case of KNN?
After going into .fit_transform() I found this:
The function has noting to do with coefficients and intercepts.
def fit_transform(self, X, y=None, **fit_params):
"""Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params
and returns a transformed version of X.
Parameters
----------
X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns
-------
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
"""
# non-optimized default implementation; override when a better
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)
So is it ok to do such practice of using PolynomialFeatures with KNN as above?

SVR predict average value but train perfectly

I have been using SVR to predict the value of a time series. My dataset is split into two which is train and test and using SVR with RBF kernel to predict the test dataset. While SVR has been perfectly modeled the train data set but always predict the average value of the test dataset.
Have been trying StandardScaller, Normalization and so on but always failed.
here is my code
X = np.array(x).reshape(-1,1)
Y = np.array(y).reshape(-1,1)
sc_y = StandardScaler()
Y = sc_y.fit_transform(Y)
Y = np.array(Y).ravel()
# Fit regression model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.4, random_state=0, shuffle=False)
from sklearn.model_selection import cross_val_score
svr_rbf = SVR(kernel='rbf', C=10, gamma=9.9999999999999995e-08, epsilon=0.1)
print(X_train.shape)
svr_rbf.fit(X_train, Y_train)
y_rbf = svr_rbf.predict(X_train)
y_rbf1 = svr_rbf.predict(X_test)
and here is my result
The prediction is at the end where a constant value is shown.
Do you know what should I do to make the prediction better?

Resources