Get Cluster Centers when using HDBSCAN Clustering [duplicate] - python-3.x

This question already has answers here:
Cluster center mean of DBSCAN in R?
(2 answers)
Closed 5 years ago.
Pretty new to clustering and trying out HDBSCAN clustering but I'm having a hard time figuring out how to get the cluster centers. With KMeans it is set with the cluster.
How do I go about getting the cluster centers?
Here's my code:
#!/usr/bin/env python3
from sklearn.cluster import KMeans
from sklearn import metrics
import cv2
import numpy as np
import hdbscan
from pprint import pprint
# Read image into opencv
image = cv2.imread('4.jpg')
# Set color space
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# reshape the image to be a list of pixels
pixels = image.reshape((image.shape[0] * image.shape[1], 3))
# Build the clusterer
cluster = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster.fit(pixels)
>>> pprint(vars(cluster))
{'_cluster_hierarchy_': <hdbscan.plots.SingleLinkageTree object at 0x110deda58>,
'_metric_kwargs': {},
'algorithm': 'best',
'alpha': 1.4142135623730951,
'core_dist_n_jobs': 4,
'cut': 0.125,
'gamma': 5,
'k': 7,
'labels_': array([ 0, 0, 0, ..., 360, 220, 172]),
'metric': 'euclidean'}
Versus this is what KMeans output gives you:
{'cluster_centers': (array([ 64.93473757, 65.65262431, 72.00103591]),
array([ 77.55381605, 85.80626223, 102.29549902]),
array([ 105.66884532, 115.81917211, 131.55555556]),
array([ 189.20149254, 197.00497512, 205.43034826]),
array([ 148.0922619 , 156.5 , 168.33333333])),
'cluster_centers_': array([[ 105.66884532, 115.81917211, 131.55555556],
[ 64.93473757, 65.65262431, 72.00103591],
[ 148.0922619 , 156.5 , 168.33333333],
[ 189.20149254, 197.00497512, 205.43034826],
[ 77.55381605, 85.80626223, 102.29549902]]),
'copy_x': True,
'inertia_': 1023155.888923295,
'init': 'k-means++',
'labels_': array([1, 1, 1, ..., 1, 1, 1], dtype=int32),
'max_iter': 300,
'n_clusters': 5,
'n_init': 10,
'n_iter_': 8,
'n_jobs': 1,
'precompute_distances': 'auto',
'random_state': None,
'tol': 0.0001,
'verbose': 0}

Clusters in (H)DBSCAN do not have centers.
The clusters may be non-convex, and if you compute the average of all points (and your data are points - they don't need to be) it may then be outside of the cluster.
Also note that DBSCAN also gives noise points,that don't have a center at all.

Related

What does the ordering/index of cluster_centers_ represent in KMeans clustering SKlearn

I have implemented the following code
k_mean = KMeans(n_clusters=5,init=centroids,n_init=1,random_state=SEED).fit(X_input)
k_mean.cluster_centers_.shape
>>
(5, 50)
I have 5 clusters of the data.
How are the clusters ordered? Are the indices of the clusters centres representing the labels?
Means does the cluster_center index at 0th position represent the label = 0 or not?
In the docs you have a smiliar example:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The indexes are ordered yes. Btw with k_mean.cluster_centers_.shapeyou only return the shape of your array, and not the values. So in your case you have 5 clusters, and the dimension of your features is 50.
To get the nearest point, you can have a look here.

export looped numpy histogram outputs to list without array information and formatting

I am running through a loop of data compiled from a number of lists. I am eventually going to create a histogram, however I want the binned outputs from the histogram function to be exported to a list. Currently, the data is exported a list, that looks like an array - I assume this is coming from the original output from numpy, but I can't seem to solve the issue. What I want ideally is the binned values for each sub-list without the information about the array and the binned headers - any pointers?
bins = [0, 1.06, 5.01, 10.01, 15]
sigmafreqdist=[]
for i in alldata:
freqdist = np.histogram(i,bins)
sigmafreqdist.append(freqdist)
#print the list
print(sigmafreqdist)
The result I get is something like this:
(array([ 6, 14, 2, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([ 5, 14, 0, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([31, 19, 2, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([12, 43, 1, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([30, 34, 1, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([12, 13, 0, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ])),
(array([12, 28, 1, 0], dtype=int64),
array([ 0. , 1.06, 5.01, 10.01, 15. ]))]
The first array is useful, but the rest isn't any good - plus the text and brackets are not required. I have tried np.delete np.tolist to no avail
Any help would be much appreciated.
I am quite new - sorry if the code is inefficient!
Here's my first answer to a stackoverflow post. Instead of using 'for loops' try and use the simpler 'list comprehension'. I hope this helps you!
import numpy as np
# Random dataset. Create a list of 3 lists. Each list contains 20 random
# numbers between 0 and 15.
Data = [np.random.uniform(0,15,20) for i in range(3)]
# Values used to bin the data.
Bins = [0, 1.06, 5.01, 10.01, 15]
# Using np.histogram on each list.
HistogramData = [np.histogram(i,Bins) for i in Data]
# 'i[0]' selects the first element in the tuple output of the histogram
# function i.e. the frequency. The function 'list()' removes the 'dtype= '.
BinnedData = [list(i[0]) for i in HistogramData]
print(BinnedData)
# Merging everything into a definition
def PrintHistogramResults(YourData, YourBins):
HistogramData = [np.histogram(i,YourBins) for i in YourData]
BinnedData = [list(i[0]) for i in HistogramData]
print(BinnedData)
TestDefinition = PrintHistogramResults(Data, Bins)

How can I do KMeans clustering in python for 8 columns in a data-frame of 14 columns?

I am trying to do clustering for the data-frame given to me. It has 14 columns. How to do clustering for 8 of those?
Below is the code that I found and followed.
Elbow method:
Visualization
# K-Means Clustering
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing tha customer Expenses Invoices dataset with pandas
dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')
plt.show()
# Applying k-means to the mall dataset
kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow',
label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()
This works perfectly but this is only for two columns(variables), i want to have it for 8 column. But I could not understand how?
With X=dataset.iloc[: , [3,2]].values you are specifically the 4th and 3rd column.
KMeans performs the clustering on all columns you selected.
Therefore you need to change X=dataset.iloc[: , [3,2]] to your needs. Eg to use the first 8 columns of your dataset: X=dataset.iloc[:, 0:8].values.
Take a look at pandas documentation for more options how to select data in dataframes: https://pandas.pydata.org/pandas-docs/stable/indexing.html
Keep in mind that you can't visualize your clusters in a 2D scatter plot as you have done before.

Spark performance in single node

Im trying to execute some of the sample Python - Scikit scripts into Spark in Single node (my desktop - Mac - 8 GB). Here is my configuration
spark-env.sh file.
SPARK_MASTER_HOST='IP'
SPARK_WORKER_INSTANCES=3
SPARK_WORKER_CORES=2
Im starting my slaves
./sbin/start-slave.sh spark://IP
Workers Table in (http://localhost:8080/) shows there are 3 workers running with each 2 cores
My script file which I took it from (https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html)
from sklearn import svm, grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from spark_sklearn.util import createLocalSparkSession
from pyspark import SparkContext
sc=SparkContext.getOrCreate()
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = GridSearchCV(sc,RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)
Submitting the script
spark-submit --master spark://IP traingrid.py
However I do not see any significant improvement in the execution time.
Is there any other configurations required to make it more parallel? Or I should add another node to improve it?

how i can fetching data in this array and if condition for make it be an cluster with minimum distance in python?

I will to implement kmeans in python, but i just don't know to process min distance from euclidean distance.
i have been calculate data in 3 cluster,
this is my result array :
[array([4, 5], dtype=int64), 4.1231056256176606, 0,
array([4, 8], dtype=int64), 4.4721359549995796, 0,
array([14, 23], dtype=int64), 22.022715545545239, 0,
array([4, 5], dtype=int64), 1.0, 1,
array([4, 8], dtype=int64), 2.0, 1,
array([14, 23], dtype=int64), 19.723082923316021, 1]
here its my code:
for i in range(len(centroidrandom)):
for j in range(3):
jarak_=euclidean_distances(data[j],centroidrandom[:][i])
cluster.append(data[j])
cluster.append(jarak_[0][0])
cluster.append(i)
print(cluster)
Here is some example code for kmeans clustering with three clusters, modified from the example given in the comment above:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation for three sets of data
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2), rand(150,2) + array([0,.5])))
# computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
print idx
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or',
data[idx==2,0],data[idx==2,1],'oy')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

Resources