What is the accuracy of a clustering algorithm? - scikit-learn

I have a set of points that I have clustered using a clustering algorithm (k-means in this case). I also know the ground-truth labels and I want to measure how accurate my clustering is. What I need is to find the actual accuracy. The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.
Is there a way to measure this accuracy? The intuitive idea would be to compute the score of the confusion matrix of every combination of labels, and only keep the maximum. Is there a function that does this?
I have also evaluated my results using rand scores and adjusted rand score. How close are these two measures to actual accuracy?
Thanks!

First of all, what does The problem, of course, is that the labels given by the clustering do not match the ordering of the original one. mean?
If you know the ground truth labels then you can re-arrange them to match the order of the X matrix and in that way, the Kmeans labels will be in accordance with the true labels after prediction.
In this situation, I suggest the following.
If you have the ground truth labels and you want to see how accurate your model is, then you need metrics such as the Rand index or mutual information between the predicted and true labels. You can do that in a cross-validation scheme and see how the model behaves i.e. if it can predict correctly the classes/labels under a cross-validation scheme. The assessment of prediction goodness can be calculated using metrics like the Rand index.
In summary:
Define a Kmeans model and use cross-validation and in each iteration estimate the Rand index (or mutual information) between the assignments and the true labels. Repeat that for all iterations and finally, take the mean of the Rand index scores. If this score is high, then the model is good.
Full example:
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np
# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()
rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# the model
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train) # fit using training data
predicted_labels = kmeans.predict(X_test) # predict using test data
rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels
print(np.mean(rand_index_scores))

Since clustering is an unsupervised learning problem, you have specific metrics for it: https://scikit-learn.org/stable/modules/classes.html#clustering-metrics
You can refer to the discussion in the scikit-learn User Guide to have an idea of the differences between the different metrics for clustering: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
For instance, the adjusted Rand index will compare a pair of points and check that if the labels are the same in the ground-truth, it will be the same in the predictions. Unlike the accuracy, you cannot make strict label equality.

you can use sklearn.metrics.accuracy as documented in link mentioned below
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
an example can be seen in link mentioned below
sklearn: calculating accuracy score of k-means on the test data set

Related

How to add more features in multi text classification?

I have a retail dataset with product_description, price, supplier, category as columns.
I used product_description as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate with axis=1.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.

How can this feature ranking problem be implemented with Support Vector Classification?

If I want the classifier to be SVM (using scikit-learn), how can I modify the 'clf' variable such that the svm classifier used for feature ranking results to high accuracy? What parameters/arguments do I need to add ? Which kernel type of SVC ('linear' or 'rbf' or 'sigmoid' or others) would you suggest for best accuracy?
The codes are referred form the following github link:
https://github.com/CynthiaKoopman/Network-Intrusion-Detection/blob/master/RandomForest_IDS.ipynb
I have 10 features which are ranked (with RecursiveFeatureElimination of scikit learn) from 1 to 10 which are from DoS attacks of the NSL-KDD dataset using RandomForestClassifier with 99% accuracy (using RFC as prediction model).
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC
# Create a decision tree classifier. clf is the 'variable for classifier'
clf = RandomForestClassifier(n_jobs = 2)
# If classifier used is svm
#clf = SVC(kernel = "linear")
#rank all features, i.e continue the elimination until the last one
rfe = RFE(clf, n_features_to_select=1)
rfe.fit(X_newDoS, Y_DoS)
print ("DoS Features sorted by their rank:")
#print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS)))
sorted_newcolname_DoS = sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS))
sorted_newcolname_DoS
I expect more or less 99% similarity between the ranked features of the two classifiers, which I didnt observe.

Sklearn logistic regression - adjust cutoff point

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

cross_validation for time series in scikit learn machine learning

I wasn't able to find information I am looking for so I will post my question here.
I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model.
I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing.
First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.
Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as #Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.

Resources