Sklearn kmeans equivalent of elbow method - python-3.x

Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:
from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]
pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()
I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:
from sklearn.cluster import KMeans
km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]
That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?
Thank you

you can use the inertia attribute of Kmeans class.
Assuming X is your dataset:
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
X = # <your_data>
distorsions = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

You had some syntax problems in the code. They should be fixed now:
Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]
The fit method just returns a self object. In this line in the original code
cluster_array = [km[i].fit(my_matrix)]
the cluster_array would end up having the same contents as km.
You can use the score method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score).

You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
x = iris.data
res = list()
n_cluster = range(2,20)
for n in n_cluster:
kmeans = KMeans(n_clusters=n)
kmeans.fit(x)
res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))
plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()

Related

Annotating clustering from DBSCAN to original Pandas DataFrame

I have working code that is utilizing dbscan to find tight groups of sparse spatial data imported with pd.read_csv.
I am maintaining the original spatial data locations and would like to annotate the labels returned by dbscan for each data point to the original dataframe and then write a csv with the same information.
So the code below is doing exactly what I would expect it to at this point, I would just like to extend it to import the label for each row in the original dataframe.
import argparse
import string
import os, subprocess
import pathlib
import glob
import gzip
import re
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sklearn.cluster import DBSCAN
X = pd.read_csv(tmp_csv_name)
X = X.drop('Name', axis = 1)
X = X.drop('Type', axis = 1)
X = X.drop('SomeValue', axis = 1)
# only columns 'x' and 'y' now remain
db=DBSCAN(eps=EPS, min_samples=minSamples, metric='euclidean', algorithm='auto', leaf_size=30).fit(X)
labels = def_inst_dbsc.labels_
unique_labels = set(labels)
# maxX , maxY are manual inputs temporarily
while sizeX > 16 or sizeY > 16 :
sizeX=sizeX*0.8 ; sizeY=sizeY*0.8
fig, ax = plt.subplots(figsize=(sizeX,sizeY))
plt.xlim(0,maxX)
plt.ylim(0,maxY)
plt.scatter(X['x'], X['y'], c=colors, marker="o", picker=True)
# hackX , hackY are manual inputs temporarily
# which represent the boundaries defined in the original dataset
poly = patches.Polygon(xy=list(zip(hackX,hackY)), fill=False)
ax.add_patch(poly)
plt.show()

Why is sklearn RandomForestClassifier root node different from the most important feature?

How is feature importance calculated in RandomForestClassifier in scikit-learn?
Here's a reproducible code. I run the classifier once with criterion set to gini and once to entropy. For each of them, I print the feature importance and plot the tree.
In neither of the instances, the root tree is the same as the most important feature. Why is that?
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image, display
from subprocess import call
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.datasets import load_iris
wines = load_wine()
iris = load_iris()
def create_and_fit(clf,model_name):
print(clf)
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=5, random_state=seed)
# X,y = iris.data; iris.target
# X,y = wines.data, wines.target
# fit the mode
clf.fit(X, y)
# get importance
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
for f in range(X.shape[1]):
print("feature {}: ({})".format(indices[f], importance[indices[f]]))
filename = model_name+model.criterion
if model_name == 'forest_':
print('forest')
export_graphviz(clf.estimators_[0], out_file=filename+'.dot')
else:
export_graphviz(clf, out_file=filename+'.dot')
f = 'tree_'+model.criterion+'.png'
call(['dot', '-Tpng', filename+'.dot', '-o', filename+'.png', '-Gdpi=600'])
seed=0
models = [
RandomForestClassifier(criterion='gini',max_depth=5, random_state=seed),
RandomForestClassifier(criterion='entropy',max_depth=5, random_state=seed),
]
names =['forest_', 'forest_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'forest_gini'+'.png')
and for the entropy
Image(filename = 'forest_entropy'+'.png')
This behaviour seems to only happen with ensembles not trees (I'm generalizing as I only tried on Random forest and Decision Tree).
Here's the snippet for decision trees
models = [
DecisionTreeClassifier(criterion='gini',max_depth=5, random_state=seed),
DecisionTreeClassifier(criterion='entropy',max_depth=5, random_state=seed)
]
names =['tree_', 'tree_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'tree_gini'+'.png')
and for the entropy
Image(filename = 'tree_entropy'+'.png')
I think I found the answer, which is related to max_features parameter in RandomForestClassifier. Here's scikit-learn documentation:
max_features{“sqrt”, “log2”, None}, int or float,
default=”sqrt”
The number of features to consider when looking for
the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features *
n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model

The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.
If I replace the code described above, I get the following error:
FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: All arrays must be equal length.
FitFailedWarning)
1) Should I define a score function in the Class RBF?
2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.
3) Is there something else missing?
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,x,y):
self.rbf = Rbf(x, y,function=self.function)
def predict(self,x):
return self.rbf(x)
if __name__ == "__main__":
# Load Data
targetName='HousePrice'
data=datasets.load_boston()
featuresNames=list(data.feature_names)
featuresData=data.data
targetData = data.target
df=pd.DataFrame(featuresData,columns=featuresNames)
df[targetName]=targetData
independent_variable_list=featuresNames
dependent_variable=targetName
X=df[independent_variable_list].values
y=np.squeeze(df[[dependent_variable]].values)
# Model Definition
model=linear_model.LinearRegression()
#model=RBF('multiquadric')
# Cross validation routine
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Lets look at the documentation here
*args : arrays
x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….
Following the documentation, your code should be
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,X,y):
self.rbf = Rbf(*X.T, y,function=self.function)
def predict(self,X):
return self.rbf(*X.T)
# Load Data
data=datasets.load_boston()
X = data.data
y = data.target
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
model = RBF(function='multiquadric')
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Output
neg_mean_squared_error:
Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
Test: Mean -23.007377210596463 Standard Error 4.254629143836107
neg_mean_absolute_error:
Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
r2:
Train: Mean 1.0 Standard Error 0.0
Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.
Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.

Unstable behavior of OneClassSVM by changing 'nu'

In the example above, I'm using my dataset to identify outliers. After making slight changes to the nu parameter, there is a huge difference in the number of anomalies identified.
Could this be just a particularity of the dataset? Or a bug in scikit-learn?
P.S. Unfortunately I cannot share the dataset.
If you decrease the value of the tol parameter of the OneClassSVM the result is better although not completely as expected for low values of nu.
import numpy as np
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
X = np.random.rand(100, 1)
nus = np.geomspace(0.0001, 0.5, num=100)
outlier_fraction = np.zeros(len(nus))
for i, nu in enumerate(nus):
outlier_fraction[i] = (OneClassSVM(nu=nu, tol=1e-12).fit_predict(X) == -1).mean()
plt.plot(nus, outlier_fraction)
plt.xlabel('nu')
plt.ylabel('Outlier fraction')
plt.show()
With the default tol you obtain the following
NOTE: not an answer. Offering a MCVE.
I also recently came across this. I would like to understand the inflection point at the low values
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
X = np.random.rand(100, 1)
nu = np.geomspace(0.0001, 1, num=100)
df = pd.DataFrame(data={'nu': nu})
for i in range(0, len(X)):
df.loc[i, 'anom_count'] = (OneClassSVM(nu=df.loc[i, 'nu']).fit_predict(X) == -1).sum()
df.set_index('nu').plot();
df.set_index('nu').plot(xlim=(0, 0.2));
df.anom_count.min() # 3
df.anom_count.idxmin() # 62
df.loc[df.anom_count.idxmin(), 'nu'] # 0.031

How to loop through items in pandas col and run and plot a scikit model?

I got some interesting user data from races. I know when the respecitve athletes planed to finish a race and I know when they actaully finished (next to some more stuff). The goal is to find out when the athletes come in late. I want to run a support vector machine for each athlete and plot the decision boundaries.
Here is what I do:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
for i, y in df['User']:
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(lab)
plt.show()
I get the following error: TypeError: 'numpy.int64' object is not iterable - that is, I somehow can't loop through the column.
I guess it comes down to the numpy data format? How can I solve that?
try iteritems()
for i, y in df['User'].iteritems():
Your User Series contains numpy.int64 objects so you can only use:
for y in df['User']:
And you don't use i anywhere.
As for the rest of the code, this produces some solution, please edit accordingly:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']].as_matrix()
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
y = df['User'].values
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title('lab')
plt.show()

Resources