The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).
I have a code snippet for calculating the Dunn Index:
def dunn_index(pf, cf):
"""
pf -- all data points
cf -- cluster centroids
"""
numerator = inf
for c in cf: # for each cluster
for t in cf: # for each cluster
if t is c: continue # if same cluster, ignore
numerator = min(numerator, distance(t, c)) # find distance between centroids
denominator = 0
for c in cf: # for each cluster
for p in pf: # for each point
if p.get_cluster() is not c: continue # if point not in cluster, ignore
for t in pf: # for each point
if t.get_cluster() is not c: continue # if point not in cluster, ignore
if t is p: continue # if same point, ignore
denominator = max(denominator, distance(t, p))
return numerator/denominator
The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.
How can I improve this algorithm?
TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.
In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)
Setup and fake data:
import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt
# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
Looks somewhat like this:
Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.
# compute the diameter based on convex hull
def diameter(pts):
# need at least 3 points to construct the convex hull
if pts.shape[0] <= 1:
return 0
if pts.shape[0] == 2:
return ((pts[0] - pts[1])**2).sum()
# two points which are fruthest apart will occur as vertices of the convex hull
hull = spatial.ConvexHull(pts)
candidates = pts[spatial.ConvexHull(pts).vertices]
return spatial.distance_matrix(candidates, candidates).max()
For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.
If the number of clusters is large, the following solution based on Pandas may perform well:
import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
Otherwise, we can continue with a pure numpy solution.
def dunn_index(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms
For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).
There are two further optimization opportunities that are relevant when the number of clusters is large:
First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.
Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.
It's not about optimizing algorithm itself, but I think one of the following advises can improve performance.
Using multiprocessing's pool of workers.
Extracting python code to c/cpp. Refer to official documentation.
Also there are Performance Tips on the https://www.python.org.
Related
I am trying to do a group-based generalization of leave one out cross validation (LOOCV) with a k-nearest neighbor classifier.
Standard LOOCV finds the k-nearest neighbors of each data point, excluding the point itself.
In my scenario, each data point is associated with a group variable (such as collection date) in addition to its features.
I would like to find the k-nearest neighbors of each data point, excluding all points in the same group (e.g., to get nearest neighbors collected on a different date).
This is straightforward to implement (below), but I am looking for a cleaner solution, ideally a few standard library calls, that is also efficient, i.e., exploiting sklearn's optimizations for NearestNeighbors search like "ball-tree" algorithm.
Setup example data and import libraries:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
np.random.seed(0)
group = pd.Series(['a', 'b', 'a', 'c', 'b', 'c'])
X = pd.DataFrame(np.random.rand(len(group), 2))
The k-nearest-neighbors needed for standard LOOCV can be computed easily with a single library call:
# k nearest neighbor of query point, excluding query point
k = 2
nbrs = NearestNeighbors(n_neighbors=k)
nbrs.fit(X)
nbrs.kneighbors(return_distance=False)
But finding the k-nearest neighbors with group-based exclusion appears to require iteration over groups, which will be slow when there are many groups:
# Inefficient for large number of groups: uses each group of data number_of_groups^2 times.
def kneighbors(X, group, k):
"""k nearest neighbors of each query point, excluding all points in query point's group."""
def kneighbors_group(g):
"""k nearest neighbors of all points in group g, excluding points in group g."""
Xg = X.loc[group==g, :]
Xnotg = X.loc[group!=g, :]
obj = NearestNeighbors(n_neighbors=min(k, Xnotg.shape[0])) # prevent number of neighbors from exceeding size of remaining data
obj.fit(Xnotg)
nbrs_notg_local = obj.kneighbors(X=Xg, return_distance=False) # locally indexed neighbors
nbrs_notg = [list(Xnotg.index[a]) for a in nbrs_notg_local] # globally indexed neighbors
return pd.Series(nbrs_notg, index=Xg.index)
# concatenate neighbors of each group and restore initial data order
return pd.concat([kneighbors_group(g) for g in group.unique()], axis=0).loc[X.index]
kneighbors(X, group, k)
Another option I am considering is using a custom distance metric where points in the same group would be defined to have distance infinity. But I think this will be too slow on larger datasets since it cannot exploit sklearn's efficient search algorithms.
My problems consists of the following: I am given two pairs angles (in spherical coordinates) which consists of two parts--an azimuth and a colatitude angle. If we extend both angles (thereby increasing their respective radii) infinitely to make a long line pointing in the direction given by the pair of angles, then my goal is to determine
if they intersect or extremely close to one another and
where exactly they intersect.
Currently, I have tried several methods:
The most obvious one is to iteratively compare each radii until there is either a match or a small enough distance between the two. (When I say compare each radii, I am referring to converting each spherical coordinate into Cartesian and then finding the euclidean distance between the two). However, this runtime is $O(n^{2})$, which is extremely slow if I am trying to scale this program
The second most obvious method is to use the optimization package to find this distance. Unfortunately, I cannot the optimization package iteratively and after one instance the optimization algorithm repeats the same answer, which is not useful.
The least obvious method is to directly calculate (using calculus) the exact radii from the angles. While this is fast method, it is not extremely accurate.
Note: while it might seem simple that the intersection is always at the zero-origin (0,0,0), this is not ALWAYS the case. Some points never intersect.
Code for Method (1)
def match1(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centroid_1,centroid_2 ):
# Constants: tolerance factor and extremely large distance
tol = 3e-2
prevDist = 99999999
# Initialize a list of radii to loop through
# Checking iteravely for a solution
for r1 in list(np.arange(0,5,tol)):
for r2 in list(np.arange(0,5,tol)):
# Get the estimates
estimate_1 = np.array(spher2cart(r1,azimuth_recon_1,colatitude_recon_1)) + np.array(centroid_1)
estimate_2 = np.array(spher2cart(r2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
## If the distance is too big break out of the loop
if dist > prevDist:
prevDist = 9999999
break
prevDist = dist
return [], [], False
Code for Method (3)
def match2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
# Set a Tolerance factor
tol = 3e-2
def calculate_radius_2(azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Return radius 2 using both pairs of angles (azimuth and colatitude). Equation is provided in the document"""
return 1/((1-(math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))**2)
def calculate_radius_1(radius_2,azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Returns radius 1 using both pairs of angles (azimuth and colatitude) and radius 2.
Equation provided in document"""
return (radius_2)*((math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))
# Compute radius 2
radius_2 = calculate_radius_2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
#Compute radius 1
radius_1 = calculate_radius_1(radius_2,azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
# Get the estimates
estimate_1 = np.array(spher2cart(radius_1,azimuth_recon_1,colatitude_recon_1))+ np.array(centroid_1)
estimate_2 = np.array(spher2cart(radius_2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
else:
return [], [], False
My question is two-fold:
Is there a faster and more accurate way to find the radii for both
points?
If so, how do I do it?
EDIT: I am thinking about just creating two numpy arrays of the two radii and then comparing them via numpy boolean logic. However, I would still be comparing them iteratively. Is there is a faster way to perform this comparison?
Use a kd-tree for such situations. It will easily look up the minimal distance:
def match(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
cartesian_1 = np.array([np.cos(azimuth_recon_1)*np.sin(colatitude_recon_1),np.sin(azimuth_recon_1)*np.sin(colatitude_recon_1),np.cos(colatitude_recon_1)]) #[np.newaxis,:]
cartesian_2 = np.array([np.cos(azimuth_recon_2)*np.sin(colatitude_recon_2),np.sin(azimuth_recon_2)*np.sin(colatitude_recon_2),np.cos(colatitude_recon_2)]) #[np.newaxis,:]
# Re-center them via adding the centroid
estimate_1 = r1*cartesian_1.T + np.array(centroid_1)[np.newaxis,:]
estimate_2 = r2*cartesian_2.T + np.array(centroid_2)[np.newaxis,:]
# Add them to the output list
n = estimate_1.shape[0]
outputs_list_1.append(estimate_1)
outputs_list_2.append(estimate_2)
# Reshape them so that they are in proper format
a = np.array(outputs_list_1).reshape(len(two_pair_mic_list)*n,3)
b = np.array(outputs_list_2).reshape(len(two_pair_mic_list)*n,3)
# Get the difference
c = a - b
# Put into a KDtree
tree = spatial.KDTree(c)
# Find the indices where the radius (distance between the points) is 3e-3 or less
indices = tree.query_ball_tree(3e-3)
This will output a list of the indices where the distance is 3e-3 or less. Now all you will have to do is use the list of indices with the estimate list to find the exact points. And there you have it, this will save you a lot of time and space!
Hello I'm on a project where I use 512 bits hash to create clusters. I'm using a custom metric bitwise hamming distance. But when I compare two hash with this function I obtain different distance results than using the NearestNeighbors.
Extending this to DBSCAN, using a eps=5, the cluster are created with some consistence, are being correctly clustered. But I try to check the distance between points from the same cluster I obtain distance enormous. Here is an example.
Example:
This a list of points from 2 clusters created by DBSCAN, and as you can see when using the function to calculate the distance gives number bigger than 30 but the NN gives results consistent with the eps=5.
from sklearn.neighbors import NearestNeighbors
hash_list_1 = [2711636196460699638441853508983975450613573844625556129377064665736210167114069990407028214648954985399518205946842968661290371575620508000646896480583712,
2711636396252606881895803338309150146134565539796776390549907030396205082681800682439355456735713892762967881436259141637319066484744271299497977370896760,
2711636396252606881918517135048330084905033589325484952567856239496981859330884970846906663264518266744879431357749780779892124020350824669153434630258784,
2711636396252797418317524490088561493800258861799581574018898781319096107333812163580085003775074676924785748114206505865657620572909617106316367216148512,
2711636196460318585955127494483972276879239064090689852809978361705086216958169367104329622890567955158961917611852516176654399246340379120409329566384160,
2711636396252606881918605860499354102197401318666579124151729671752374458560929422237113300739169875232495266727513833203360007861082211711747836501459040,
2685449071597530523833230885351500532369477539914318172159429043161052628696351016818586542171509728747070238075233795777242761861490021015910382103951968,
2685449271584547381638295372872027557715092296493457397817270817010861872186702795218797216694169625716749654321460983923962566367029011600112932108533792,
2685449071792640184514638654713547133316375160837810451952682241651988724244365461216285304336254942220323815140042850082680124299635209323646382761738272,
1847461275963134712629870519594779049860430827711272857522520377357653173694038204556169999876899727026751811340128091158803029889914422883922033917198368,
2711636396252606881901567718540735842607739343712295416931961674938924754114357607352250040524848697769853213132484145241805622979375000168935113673834592,
2711636396252606881901567718538101947732706353297593371282460773094032493492652041376662823635245997887100968237677157520342076957158825588198798784364576]
hash_list_2 = [1677246762479319235863065539858628614044010438213592493389244703420353559152336301659250128835190166728647823546464421558167523127086351613289685036466208,
1677246762479700308655934218233084077989052614799077817712715603728397519829375248244181345837838956827991047769168833176865438232999821278031784406056992,
1677246762479700314487411751526941880161990070273187005125752885368412445003620183982282356578440274746789782460884881633682918768578649732794162647826464,
1677246762479319238759152196394352786642547660315097253847095508872934279466872914748604884925141826161428241625796765725368284151706959618924400925900832,
1677246762479890853811162999308711253291696853123890392766127782305403145675433285374478727414572392743118524142664546768046227747593095585347134902140960,
1677246765601448867710925237522621090876591539557992237656925108430781026329148912958069241932475038282622646533152559554888274158032061637714105308528752,
1678883457783648388335228538833424204662395277995143067623864457726472665342252064374635323999849241968448535982901839797440478656657327613912450890367008,
1677246765601448864793634462245189770642489500950753120409198344054454862566173176691699195659218600616315451200851360013275424257209428603245704937128032,
1677246762479700314471974894075267160937462491405299015541470373650765401692659096424270522124311243007780041455682577230603077926878181390448030335795232,
1677246762479700317400446530288778920091525622772690226165317385340164047644547471081180880454458397836230795248631079659291423401151022423365062554976288,
1677246762479700317400446530288758590086745084806873060513679821541689120894219120403259478342385343805541797540566045409406476458247878183422733877936160,
2516871453405060707064684111867902766968378200849671168835363528433280949578746081906100803610196553501646503982070255639855643685380535999494563083255776,
1677246762479319230037086118512223039643232176451879100417048497454912234466993748113993020733268935613563596294183318283010061477487433484794582123053088,
1677246762479319235834673272207747972667132521699112379991979781620810490520617303678451683578338921267417975279632387450778387555221361833006151849902112,
1677246762479700305748490595643272813492272250002832996415474372704463760357437926852625171223210803220593114114602433734175731538424778624130491225112608]
def custom_metric(x, y):
return bin(int(x[0]) ^ int(y[0])).count('1')
objective_hash = hash_list_1[0]
complete_list = hash_list_1 + hash_list_2
distance = [custom_metric([objective_hash], [hash_point]) for hash_point in complete_list]
print("Function iteration distance:")
print(distance)
neighbors_model = NearestNeighbors(radius=100, algorithm='ball_tree',
leaf_size=2,
metric=custom_metric,
metric_params=None,
n_jobs=4)
X = [[x] for x in complete_list]
neighbors_model.fit(X)
distance, neighborhoods = neighbors_model.radius_neighbors(objective_hash, 100, return_distance=True)
print("Nearest Neighbors distance:")
print(distance)
print("Nearest Neighbors index:")
print(neighborhoods)
The problem:
Numpy can't handle numbers so big and converts them to float losing a lot of precision.
The solution:
Precompute with your custom metric all the distances and feed them to the DBSCAN algorithm.
I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.
normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )
I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.
def gaussmin(var,data):
# var[0] = normc
# var[1] = mu
# var[2] = sigma
# data is the sorted random numbers, represents unbinned observed values
for index in range(len(data)):
return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) )
# I realize this will return a new value for each index of data, any guidelines to fix?
After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:
var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b)
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?
There are plenty of examples online for scalar functions but I cannot find any for variable functions.
PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.
A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.
I have a set of categorical variables to be clustered and so I am using k modes taken from a github package. I want to get the distance of each observation (point) to the centroid of the cluster it belongs to.
This is what I have implemented so far:
kmodes_cao = kmodes.KModes(n_clusters=6, init='Cao', verbose=1)
kmodes_cao.fit_predict(data)
# Print cluster centroids of the trained model.
print('k-modes (Cao) centroids:')
print(kmodes_cao.cluster_centroids_)
# Print training statistics
print('Final training cost: {}'.format(kmodes_cao.cost_))
print('Training iterations: {}'.format(kmodes_cao.n_iter_))
I cannot use the Eucledean distance since the variables are categorical. What is the ideal way to calculate the distance of each point to its cluster centroid?
Example if you have 2 variables V1 which can take A or B and V2 can take C or D
If your centroid is V1=A and V2=D
For each variable i, count when Vi != Ci(centroid i)
if you have an instance V1=A and V2=C then the distance from the centroid is 1
it is a binary distance
hop that will help
You can use the method matching_dissim() from kmodes library.
To compare to rows in your dataset, one can be you centroid and other anyone. First you must install the library Panda then import the method with this line:
from kmodes.util.dissim import matching_dissim
https://github.com/nicodv/kmodes/issues/39