How can I get a representative point of a GMM cluster? - scikit-learn

I have clustered my data (75000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 4 clusters. Each point of my data represents a molecular structure. Now I would like to get the most representative molecular structure of each cluster which I understand is the centroid of the cluster. So far, I have tried to locate the point (structure) that is right in the centre of the cluster using gmm.means_ attribute, however that exact point does not correspond to any structure (I used numpy.where). I would need to obtain the coordinates of the closest structure to the centroid, but I have not found the function to do that in the documentation of the module (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html). How can I get a representative structure of each cluster?
Thanks a lot for your help, any suggestion will be appreciated.
((As this is a generic question I haven't found necessary to add the code used for the clustering or any data, please let me know if it is necessary))

For each cluster, you can measure its corresponding density for each training point, and choose the point with the maximal density to represent its cluster:
This code can serve as an example:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from sklearn import mixture
n_samples = 100
C = np.array([[0.8, -0.1], [0.2, 0.4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
np.random.randn(n_samples, 2) + np.array([-2, 1]),
np.random.randn(n_samples, 2) + np.array([1, -3])]
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(X)
plt.scatter(X[:,0], X[:, 1], s = 1)
centers = np.empty(shape=(gmm.n_components, X.shape[1]))
for i in range(gmm.n_components):
density = scipy.stats.multivariate_normal(cov=gmm.covariances_[i], mean=gmm.means_[i]).logpdf(X)
centers[i, :] = X[np.argmax(density)]
plt.scatter(centers[:, 0], centers[:, 1], s=20)
plt.show()
It would draw the centers as orange dots:

Find the point with the smallest Mahalanobis distance to the cluster center.
Because GMM uses Mahalanobis distance to assign points. By the GMM model, this is the point with the highest probability of belonging to this cluster.
You have all you need to compute this: cluster means_ and covariances_.

Related

How to connect the points from a contour of a binary image

I have a segmentation result stored in a binary image, from which i want to extract the contours. To do so, I compute the difference between the mask and the eroded mask. Hence, I am able to extract the pixels that are on the boundaries of my segmentation result. Here is a code snippet:
import numpy as np
from skimage.morphology import binary_erosion
from matplotlib import pyplot as plt
# mask is a 2D boolean np.array containing the segmentation result
contour_raw=np.logical_xor(mask,binary_erosion(mask))
contour_y,contour_x=np.where(contour_raw)
fig=plt.figure()
plt.imshow(mask)
plt.plot(contour_x,contour_y,'.r')
I end up with a collection of dots on the contours of the mask:
The troubles starts when I want to connect the dots. Doing a naive plot of the contours results of course in a disappointing results, because contour_x and contour_y are not sorted as I would like:
plt.plot(contour_x,contour_y,'--r')
And here is the result, with a focus on an arbitrary part of the figure to highlight the connection between the dots:
How is it possible to sort the contours coordinates contour_x and contour_y so that they are correctly ordered when I connect the dot? Furthermore, if my mask contains several independent connected component, I would like to obtain as many contours as there are connected components.
Thanks for your help!
Best,
I think combining a clustering and convex hull works in your case. For this example, I am generating three synthetic segments using make_blobs function and demonstrating each with a color:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull, convex_hull_plot_2d
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, random_state=0, cluster_std=0.3)
plt.scatter(X[:,0], X[:,1], c=y)
Then, since segments are distributed in a two dimensional map, we can run density based clustering method to cluster them, and then by finding a convex hull around each cluster, we can find points surrounding those clusters coming with order:
# Fitting Clustering
c_alg = DBSCAN()
c_alg.fit(X)
labels = c_alg.labels_
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
segment = X[ind, :][0]
hull = ConvexHull(segment)
plt.plot(segment[:, 0], segment[:, 1], 'o')
for simplex in hull.simplices:
plt.plot(segment[simplex, 0], segment[simplex, 1], 'k-')
However in your case Concave Hull should work not Convex Hull. There is a package alphashape in python that claimed to find Concave hulls in two-dimensional maps. More information here. The tricky part is to find the best alpha, but in this example, we can fit concave hulls using:
import alphashape
from descartes import PolygonPatch
fig, ax = plt.subplots()
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
points = X[ind, :][0,:,:]
alpha_shape = alphashape.alphashape(points,5.0)
ax.scatter(*zip(*points))
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.5))
plt.show()

Sample from a distribution function in python

I have an array.
array([1,1,1,1,1,
1,1,1,0.96227599,0,
0,1,1,1,1,
0,0,1,0,0,
1,1,1,0,1,
1,1,0,1,0,
0,1,0,0,1,
0,0,1,1,1,
1,1,0,1,1,
1,1,1,1,1,
1,1,1,1,1,
1,1,0,0,0,
1,0,1,1,1,
1,1,1,1,1,
1,1,1,1,1,
0.94795539,0.85308765,0,0,1,
1,1,0.9113806,1,1,
1,1,1,1,1,
1,0,1,1,0,
1,1,1,1,1,
1,1,0.20363486,0.50635838,0.52025932,
0,0.34747655,0.50147493,0,0.4848249,
0,0.88495575,0,0.27620151,0.3981369,
0,0,0])
Values ​​range from 0 to 1.
How can I plot a probability distribution function? And then fill a table with 1000 rows based on it, where each row has 5 columns. In fact, fill the table with samples of 5 values:
To get a pdf from your samples you could use a kernel density estimator.
One option is the gaussian_kde form scipy.stats.
It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
Your samples look strongly bimodel with clusters at 0 and 1, so you might be better advised to use sklearns KernelDensity. Here you have more control over the specific algorithm, kernel and the bandwidth.
Sklearn also has an introduction to Density Estimation
The workflow with both methods is quite similar:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
a = np.array([1,1,1,1,1,1,1,1,0.96227599,0,0,1,1,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,1,0,0,1,0,0,1,0,0,1,1,1,1,1,0,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0.94795539,0.85308765,0,0,1,1,1,0.9113806,1,1,1,1,
1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,0.20363486,0.50635838,0.52025932,0,0.34747655,0.50147493,0,0.4848249,0,
0.88495575,0,0.27620151,0.3981369,0,0,0])
kde1 = stats.gaussian_kde(a)
x1 = np.linspace(0, 1, 100)
y1 = kde1.pdf(x1)
kde2 = KernelDensity(bandwidth=0.1).fit(a.reshape(-1, 1))
y2 = kde2.sample(10000)
kde3 = KernelDensity(bandwidth=0.01).fit(a.reshape(-1, 1))
y3 = kde3.sample(10000)
fig, ax = plt.subplots()
ax.plot(x1, y1, c='b')
ax.hist(y2.ravel(), bins=100, density=True, color='r', alpha=0.7)
ax.hist(y3.ravel(), bins=100, density=True, color='m', alpha=0.7)
Note that this method those not limit your pdf on values between [0, 1].
You have to take care of this yourself ie by filtering those out in a second step. However if you choose a small bandwith you scould come pretty close.
I do not quite understand the second part of your question.
If you want to draw new samples from the estimated distribution you can do so via kde.sample() (sklearn) / kde.resample() (scipy). And filling those values into a table is a different question for which you definitely will find answers here on StackOverflow.

How to inverse scaled numpy array for visualization purposes?

I am doing clustering and conducted scaling therefore. I now want my visualization (cluster chart) to use the original data points, i.e. before they were scaled. I did not come across a good solution yet. I hope someone can help.
#convert df='data' to numpy array for clustering
data=data.values
X=data
#Scale
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
#Internal indeces measure for performance
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
# Plot result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters, excluding noise cluster: %d' % n_clusters_)
plt.xlabel('A', fontsize=18)
plt.ylabel('B', fontsize=16)
plt.ylim(ymax = 5, ymin = -0.5)
plt.xlim(xmax = 5, xmin = -0.5)
plt.show();
Output: It shows the cluster graph but with scaled values on the axis.
Questions:
1. How can I plot it with its original values?
2. Am I missing anything in general for doing DBSCAN clustering? i.e. How do I ensure that my cluster performance is good? I do not have a ground truth, so I only used the Shilouette metric but I feel not confident that my model's performance is really good. What is the purpose of ground truth if I am NOT trying to predict in my case and rather describe the current state only?
Just plot the original data then.
I.e., plot data, not X, if that is what you want.
Cluster performance is inherently subjective. It is good if you learn something about your data that you did not know before. As it cannot be captured in equations what you "know" or what is "useful", 8t cannot be reliably evaluated. Any evaluation is just a heuristic. Silhouette is not a good choice, because it punishes noise and non-convex clusters. Internal measures are just like clustering algorithms. External measures compute how well they find something you already know - neither is good for actual data. External measures are popular for scientific papers though to demonstrate that the algorithm isn't complete garbage. You pretend you do not know what you do know, and then check if the algorithm can still find that pattern.
So what do you need to do? Investigate: does it look useful, is it worth trying to use this. Then proceed: Try to use the clustering to solve your problem. It is good if it helps solving your problem.

Using Kmeans to cluster small phrases in Spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?
k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Measure the uniformity of distribution of points in a 2D square

I am currently running into this problem: I have a 2D square, and have a set of points inside it, say, 1000 points. I need a way to see if the distribution of points inside the square are spread out (or more or less uniformly distributed) or they tend to gather together in some spot area inside the square.
Need a mathematical/statistical (not programming) way to determine this. I googled, found something like goodness of fit, Kolmogorov... and just wonder if there are other approaches to achieve this. Need this for class paper.
So: Inputs: a 2D square, and 1000 points.
Output: yes/no (yes = evenly spread out, no = gathering together in some spots).
Any idea would be appreciated.
Thanks
If your points are independent you can just check the distribution for each dimension individually. The Kolmogorov-Smirnov test (a measure of the distance between 2 distributions) is a good test for this. First let's generate and plot some Gaussian-distributed points so you can see how you can use the KS test (statistic) to detect a nonuniform distribution.
>>> import numpy as np
>>> from matplotlib.pyplot import plt
>>> X = np.random.gauss(1000, 2) # 1000 2-D points, normally distributed
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> X = scaler.fit_transform(X) # fit to default uniform dist range 0-1
>>> X
array([[ 0.46169481, 0.7444449 ],
[ 0.49408692, 0.5809512 ],
...,
[ 0.60877526, 0.59758908]])
>>> plt.scatter(*list(X))
>>> from scipy import stats
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> stats.kstest(MinMaxScaler().fit_transform(X[:,0]), 'uniform')
KstestResult(statistic=0.24738043186386116, pvalue=0.0)
The low p-value and high KS-statistic (distance from the uniform distribution) says nearly certainly did not come from a uniform distribution between 0 and 1
>>> stats.kstest(StandardScaler().fit_transform(X[:,0]), 'norm')
KstestResult(statistic=0.028970945967462303, pvalue=0.36613946547024456)
But they probably did come from a normal distribution with mean 0 and standard deviation 1 because of the high p-value and low KS distance.
Then you'd just repeat the KS-Tests for the second dimension (Y)

Resources