sklearn-KNearestNeighbors with Multilabels - scikit-learn

I have a dataset with features and their labels.
it looks like this:
X1, X2, X3, X4, X5 .. Xn L1, L2, L3
Y1, Y2, Y3, Y4, Y5 .. Yn L5, L2
..
I want to train a KNeighborsClassifier on this dataset. It seems like sklearn does not take multilabels. I have been trying this:
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(Y)
# parameters: n_neighbors=[5,15], weights = 'uniform', 'distance'
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors =5,weights ='uniform'), max_samples = 0.6, max_features= 0.7, verbose =1, oob_score =True)
scores = cross_val_score(bagging, X, Y, verbose =1, cv=3, n_jobs=3, scoring='f1_macro')
It is giving me ValueError: bad input shape
Is there a way that I can run multilabel classifier in sklearn?

According to sklearn documentation the classifiers that support multioutput-multiclass classification tasks are:
Decision Trees, Random Forests, Nearest Neighbors

Since you have a binary matrix for your labels, you can use OneVsRestClassifier to make your BaggingClassifier handle multilabel predictions. Code should now look like:
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors=5, weights='uniform'), max_samples=0.6, max_features=0.7, verbose=1, oob_score=True)
clf = OneVsRestClassifier(bagging)
scores = cross_val_score(clf, X, Y, verbose=1, cv=3, n_jobs=3, scoring='f1_macro')
You can use the OneVsRestClassifier with any of the sklearn models to do multilabel classification.
Here's an explanation:
http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest
And here are the docs:
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

For anybody who finds this looking for multi-label KNN (MLKNN) options, I would recommend using skmultilearn, which is built on top of sklearn, so easy to use if you are familiar with the latter package.
Documentation here. This example is from the documentation:
from skmultilearn.adapt import MLkNN
classifier = MLkNN(k=3)
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)

Related

Different results using OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2)) compared to KNeighborsClassifier(n_neighbors=2)

I'm implementing a multi-class classifier and I'm getting different results when wrapping KNN in a multi-class classifier.
Unsure why as I understood KNN worked for multiclass already?
y = rock_df['Sample_type']
X = rock_df[col_list]
def model_eval(model, X,y):
""" Function implements classifier model on X and y with a 0.33 test hold out, stratified by y and returns accuracy and standard deviation
Inputs:
model: The ML model to be tested
X: the cleaned and preprocessed data (normalized, and NAN dealt with)
y: Target labels for input data X
"""
#Split train /test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify = y)
n = X_test.size
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#Scoring
confusion_matrix(y_test,y_pred)
balanced_accuracy_score( y_test, y_pred)
scores = cross_val_score(model, X, y, cv=3)
mean= scores.mean()
sd = scores.std()
print("For {} : {:.1%} accuracy on cross validation, with a standard deviation of {:.1%}".format(model, mean, sd) )
# binomial confidence interval - 95% -- confirm difference with SD
#interval = 1.96 * sqrt( (mean * (1 - mean)) /n )
#print('Confidence Interval: {:.3%}'.format(interval) )
#return balanced_accuracy_score, confusion_matrix
model = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2))
model_eval(model, X,y)
model = KNeighborsClassifier(n_neighbors=2)
model_eval(model, X,y)
First model I get:
For OneVsRestClassifier(estimator=KNeighborsClassifier(n_neighbors=2)) : 78.6% accuracy on cross validation, with a standard deviation of 5.8%
second:
For KNeighborsClassifier(n_neighbors=2) : 83.3% accuracy on cross validation, with a standard deviation of 8.9%
thanks
It is OK that you have different results. KNeighborsClassifier doesn't employ one-vs-rest strategy; majority vote works with 3 and more classes and there is no need to have OvR in the original implementation. But trying OneVsRestClassifier might be useful as well. I believe that generally decision boundaries will be different. Here I played with Iris dataset to get decision boundaries using KNeighborsClassifier(n_neighbors=5) and OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5)):

visualize predict_proba for multiclass classification

With model.predict_proba(X) I just get a big array with lots of numbers.
I am looking for a way to visualize the probabilities of a classification for all classes (in my case 13). I use a RandomForestClassifier.
Any recommendation?
Heatmaps would be nice way to visualise a 2D matrix. Of-course, if the number of records in your X is large, it is hard to visualize everything in a single go. Probably you have to sample records otherwise. Here I'm showing the visuals for first 10 records, labelling the predicted classes if the predicted probability is greater than 0.1.
Check out this example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
X, y = make_classification(n_samples=10000,n_features=40,
n_informative=30, n_classes=13,
n_redundant=0, n_clusters_per_class=1,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)
forest = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
pred = forest.predict_proba(X_test)[:10]
fig, ax = plt.subplots(figsize= (20,8))
im = ax.imshow(pred, cmap='Blues')
ax.grid(axis='y')
ax.set_xticklabels([])
ax.set_yticks(np.arange(pred.shape[0]))
plt.ylabel('Records', fontsize='xx-large')
plt.xlabel('Classes', fontsize='xx-large')
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
for i in range(pred.shape[0]):
for j in range(13):
if pred[i, j] >.1:
ax.text(j, i, j,
ha="center", va="center", color="w", fontsize=30)
If your input space is 2D, or if you use some dimensionality reduction technique to embed it in 2D, you could plot the multiclass decision surface:
# generate toy data
X, y = sklearn.datasets.make_blobs(n_samples=1000, centers=13)
# fit classifier
clf = sklearn.ensemble.RandomForestClassifier().fit(X, y)
# create decision surface
xx, yy = np.meshgrid(np.linspace(-13, 12, 100),
np.linspace(-13, 12, 100))
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap='Paired')
ax.contourf(xx, yy, Z, cmap='Paired', alpha=0.5)
Note this is only shading per label (predict not predict_proba) but you may be able to extend this to shade differently based on the probability.

Running train-test split and obtaining model accuracies for different datasets

I want to run train_test_split from sklearn package, using the same target variable y, but three different dataframes of independent variables. Then, I want to fit and predict using a Random Forest Classifier and get the accuracy. The goal here is to get accuracies for the three different dataframes so that I can compare them and select my variables accordingly.
I have the following so far, which is not working.
df = [X1, X2, X3] # 3 different independent variable (features) DataFrames.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn import metrics
rf_accuracy = []
for index, z in enumerate(df):
train_X, test_X, train_y, test_y = train_test_split(z, y,train_size=0.5,test_size=0.5, random_state=2)
rf = RandomForest(random_state=99)
rf.fit(train_X, train_y.ravel())
pred_y = rf.predict(test_X)
rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))
print(rf_accuracy)
When I print the rf_accuracy, I should get a list with three accuracies from using three different feature spaces X1, X2, X3, respectively.
For example, rf_accuracy will output [0.9765, 0.9645, 0.9212]
I guess that your data are like this
assert df.shape == (n_samples, 3) # each column for a variable/features
assert y.shape == (n_samples, )
and you are trying to train three RF clfs on the three different variables/features respectively.
Now, you can try this
for _, z in df.iteritems():
train_X, test_X, train_y, test_y = train_test_split(
z.values.reshape(-1, 1), y, train_size=0.5, test_size=0.5, random_state=2)
rf = RandomForest(random_state=99)
rf.fit(train_X, train_y.ravel())
pred_y = rf.predict(test_X)
rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))
print(rf_accuracy)
I succeeded in working on the iris dataset.
New:
my modification

Different result roc_auc_score and plot_roc_curve

I am training a RandomForestClassifier (sklearn) to predict credit card fraud. When I then test the model and check the rocauc score i get different values when I use roc_auc_score and plot_roc_curve. roc_auc_score gives me around 0.89 and the plot_curve calculates AUC to 0.96 why is that?
The labels are all 0 and 1 as well as the predictions are 0 or 1.
CodE:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
pred_test = clf.predict(X_test)
print(roc_auc_score(y_test, pred_test))
clf_disp = plot_roc_curve(clf, X_test, y_test)
plt.show()
Output of the code (the roc_auc_Score is just above the graph).
You are feeding the prediction classes instead of prediction probabilities to
roc_auc_score.
From Documentation:
y_score: array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers).
change your code to:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
y_score = clf.predict_prob(X_test)
print(roc_auc_score(y_test, y_score[:, 1]))
The ROC Curve and the roc_auc_score take the prediction probabilities as input, but as I can see from your code you are providing the prediction labels. You need to fix that.

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

Resources