How to inverse scaled numpy array for visualization purposes? - python-3.x

I am doing clustering and conducted scaling therefore. I now want my visualization (cluster chart) to use the original data points, i.e. before they were scaled. I did not come across a good solution yet. I hope someone can help.
#convert df='data' to numpy array for clustering
data=data.values
X=data
#Scale
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
#Internal indeces measure for performance
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
# Plot result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters, excluding noise cluster: %d' % n_clusters_)
plt.xlabel('A', fontsize=18)
plt.ylabel('B', fontsize=16)
plt.ylim(ymax = 5, ymin = -0.5)
plt.xlim(xmax = 5, xmin = -0.5)
plt.show();
Output: It shows the cluster graph but with scaled values on the axis.
Questions:
1. How can I plot it with its original values?
2. Am I missing anything in general for doing DBSCAN clustering? i.e. How do I ensure that my cluster performance is good? I do not have a ground truth, so I only used the Shilouette metric but I feel not confident that my model's performance is really good. What is the purpose of ground truth if I am NOT trying to predict in my case and rather describe the current state only?

Just plot the original data then.
I.e., plot data, not X, if that is what you want.
Cluster performance is inherently subjective. It is good if you learn something about your data that you did not know before. As it cannot be captured in equations what you "know" or what is "useful", 8t cannot be reliably evaluated. Any evaluation is just a heuristic. Silhouette is not a good choice, because it punishes noise and non-convex clusters. Internal measures are just like clustering algorithms. External measures compute how well they find something you already know - neither is good for actual data. External measures are popular for scientific papers though to demonstrate that the algorithm isn't complete garbage. You pretend you do not know what you do know, and then check if the algorithm can still find that pattern.
So what do you need to do? Investigate: does it look useful, is it worth trying to use this. Then proceed: Try to use the clustering to solve your problem. It is good if it helps solving your problem.

Related

Failing a simple Cosine fit in Python

Here's how I generate my data and the tried fit:
import matplotlib.pyplot as plt
from scipy import optimize
import numpy as np
def f(t,a,b):
return a*np.cos(b*t)
v = 0
x = 0.03
t = 0
dt = 0.001
time = []
pos = []
while t<3:
a = (-5*x)/0.1
v = v + a*dt
x = x + v*dt
time.append(t)
pos.append(x)
t = t+dt
pop, pcov = optimize.curve_fit(f,time,pos)
print(pop)
Even when I indicate initial values for the parameters (such as 0.03 for "a" and "7" for b), the resulting fit is still way off (see below, dashed line is the fit function).
Am I using the wrong library? or have I made an obvious blunder?
Thanks for any hints.
As Tyberius noted, you need to provide better initial values.
Why is that? optimize.curve_fit uses least_squares which finds a local minimum of the cost function.
I believe in your case you are stuck in such a local minimum (that is not the global minimum). If you look at your diagram, your fit is approximately y=0. (It is a bit wavy because it is a cosine)
If you were to increase a a bit the error would go up, so a stays close to zero. And if you were to increase b to fit the frequency of the data, the cost function would go up as well so that one stays low as well.
If you don't provide initial values, the parameters start at 1 each so it looks like this:
plt.plot(time, pos, 'black', label="data")
a,b = 1,1
init = [a*np.cos(b*t) for t in time]
plt.plot(time, init, 'b', label="a,b=1,1")
plt.legend()
plt.show()
a will go down and b will stay behind. I believe the scale is an additional problem. If you normalized your data to have an amplitude of 1 the humps might be more pronounced and easier to fit.
If you start with a convenient value for a, b can find its way from an initial value as low as 5:
plt.plot(time, pos, 'black', label="data")
for i in [1, 4.8, 4.9, 5]:
pop, pcov = optimize.curve_fit(f,time,pos, p0=(0.035,i))
a,b = pop
fit = [a*np.cos(b*t) for t in time]
plt.plot(time, fit, label=f"$b_0 = {i}$")
plt.legend()
plt.show()

How to density based clustering for speed trajectories in a video?

I have a speed of feature points at every frame. Here I have 165 frames in a video where every frame contains speed of feature points.This is my data.
TrajDbscanData
array([[ 1. , 0.51935178],
[ 1. , 0.52063496],
[ 1. , 0.54598193],
...,
[165. , 0.47198981],
[165. , 2.2686042 ],
[165. , 0.79044946]])
where first index is frame number and second one is speed of a feature point at that frame.
Here I want to do density based clustering for different speed range. For this , I use following code.
import sklearn.cluster as sklc
core_samples, labels_db = sklc.dbscan(
TrajDbscanData, # array has to be (n_samples, n_features)
eps=0.5,
min_samples=15,
metric='euclidean',
algorithm='auto'
)
core_samples_mask = np.zeros_like(labels_db, dtype=bool)
core_samples_mask[core_samples] = True
unique_labels = set(labels_db)
n_clusters_ = len(unique_labels) - (1 if -1 in labels_db else 0)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
plt.figure(figcount)
figcount+=1
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels_db == k)
xy = TrajDbscanData[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
xy = TrajDbscanData[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'x', markerfacecolor=col, markeredgecolor='k', markersize=4)
plt.rcParams["figure.figsize"] = (10,7)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.grid(True)
plt.show()
I got the following result.
Y axis is speed and x axis is frame number
I want to do density based clustering according to speed. for example speed upto 1.0 in one cluster , speed from 1 to 1.5 as outlier , speed from 1.5 to 2.0 another cluster and speed above 2.0 in another cluster. This helps to identify common motion pattern types. How can I do this ?
Don't use Euclidean distance.
Since your x and y a is have very different meaning, that is the wrong distance function to use.
Your plot is misleading, because the axes have different scale. If you would scale x and y the same way, you would understand what has been happening... The y axis is effectively ignored, and you slice the data by your discrete integer time axis.
You may need to use Generalized DBSCAN and treat time and value separately!

sklearn - label points of PCA

I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()
PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.

How can I get a representative point of a GMM cluster?

I have clustered my data (75000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 4 clusters. Each point of my data represents a molecular structure. Now I would like to get the most representative molecular structure of each cluster which I understand is the centroid of the cluster. So far, I have tried to locate the point (structure) that is right in the centre of the cluster using gmm.means_ attribute, however that exact point does not correspond to any structure (I used numpy.where). I would need to obtain the coordinates of the closest structure to the centroid, but I have not found the function to do that in the documentation of the module (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html). How can I get a representative structure of each cluster?
Thanks a lot for your help, any suggestion will be appreciated.
((As this is a generic question I haven't found necessary to add the code used for the clustering or any data, please let me know if it is necessary))
For each cluster, you can measure its corresponding density for each training point, and choose the point with the maximal density to represent its cluster:
This code can serve as an example:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from sklearn import mixture
n_samples = 100
C = np.array([[0.8, -0.1], [0.2, 0.4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
np.random.randn(n_samples, 2) + np.array([-2, 1]),
np.random.randn(n_samples, 2) + np.array([1, -3])]
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(X)
plt.scatter(X[:,0], X[:, 1], s = 1)
centers = np.empty(shape=(gmm.n_components, X.shape[1]))
for i in range(gmm.n_components):
density = scipy.stats.multivariate_normal(cov=gmm.covariances_[i], mean=gmm.means_[i]).logpdf(X)
centers[i, :] = X[np.argmax(density)]
plt.scatter(centers[:, 0], centers[:, 1], s=20)
plt.show()
It would draw the centers as orange dots:
Find the point with the smallest Mahalanobis distance to the cluster center.
Because GMM uses Mahalanobis distance to assign points. By the GMM model, this is the point with the highest probability of belonging to this cluster.
You have all you need to compute this: cluster means_ and covariances_.

Using Kmeans to cluster small phrases in Spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?
k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Resources