Using Kmeans to cluster small phrases in Spark - apache-spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?

k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Related

Removing the redundant feature from classification dataset ( make_classification )

In the make_classification method,
X,y = make_classification(n_samples=10, n_features=8, n_informative=7, n_redundant=1, n_repeated=0 , n_classes=2,random_state=6)
Docstring about n_redundant: The number of redundant features. These features are generated as
random linear combinations of the informative features.
Docstring about n_repeated: The number of duplicated features, drawn randomly from the informative
n_repeated features are picked easily as they are highly correlated with informative features.
The docstring for repeated and redundant features indicates that both are drawn from informative features.
My question is: how redundant features can be removed/highlighted, what are their characteristics.
Attached is the correlation heatmap among all the features, Which feature in the image is redundant.
Please help.
To check how many independent columns use np.linalg.matrix_rank(X)
To find indices of linearly independent rows of matrix X use sympy.Matrix(X).rref()
DEMO
Generate dataset and check number of independent columns (matrix rank):
from sklearn.datasets import make_classification
from sympy import Matrix
X, _ = make_classification(
n_samples=10, n_features=8, n_redundant=2,random_state=6
)
np.linalg.matrix_rank(X, tol=1e-3)
# 6
Find indices of linearly independent columns:
_, inds = Matrix(X).rref(iszerofunc=lambda x: abs(x)<1e-3)
inds
#(0, 1, 2, 3, 6, 7)
Remove dependent columns and check matrix rank (num of independent columns):
#linearly independent
X_independent = X[:,inds]
assert np.linalg.matrix_rank(X_independent, tol=1e-3) == X_independent.shape[1]

how tfidf value is used in k-means clustering

I am using K-means clustering with TF-IDF using sckit-learn library. I understand that K-means uses distance to create clusters and the distance is represented in (x axis value, y axis value) but the tf-idf is a single numerical value. My question is how is this tf-idf value converted into (x,y) value by K-means clustering.
TF-IDF isn't a single value (i.e. scalar). For every document, it returns a vector where each value in the vector corresponds to each word in the vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix
sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"
corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)
X = vectorizer.fit_transform(corpus)
print(X.todense())
[out]:
matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
0. , 0.25038633, 0.35190925, 0.50077266],
[0.35409974, 0. , 0.35409974, 0.35409974, 0.35409974,
0.49767483, 0.35409974, 0. , 0.35409974]])
It returns a 2-D matrix where the rows represents the sentences and the columns represent the vocabulary.
>>> vectorizer.vocabulary_
{'the': 8,
'quick': 7,
'brown': 0,
'fox': 2,
'jumps': 3,
'over': 6,
'lazy': 4,
'dog': 1,
'mr': 5}
So when K-means tries to find the distance/similarity between two documents, it's performing the similarity between two rows in the matrix. E.g. assuming the similarity is just the dot product between two rows:
import numpy as np
vector1 = X.todense()[0]
vector2 = X.todense()[1]
float(np.dot(vector1, vector2.T))
[out]:
0.7092938737640962
Chris Potts has a nice tutorial on how vector space models like TF-IDF one is created http://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf

How to inverse scaled numpy array for visualization purposes?

I am doing clustering and conducted scaling therefore. I now want my visualization (cluster chart) to use the original data points, i.e. before they were scaled. I did not come across a good solution yet. I hope someone can help.
#convert df='data' to numpy array for clustering
data=data.values
X=data
#Scale
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
#Internal indeces measure for performance
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
# Plot result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters, excluding noise cluster: %d' % n_clusters_)
plt.xlabel('A', fontsize=18)
plt.ylabel('B', fontsize=16)
plt.ylim(ymax = 5, ymin = -0.5)
plt.xlim(xmax = 5, xmin = -0.5)
plt.show();
Output: It shows the cluster graph but with scaled values on the axis.
Questions:
1. How can I plot it with its original values?
2. Am I missing anything in general for doing DBSCAN clustering? i.e. How do I ensure that my cluster performance is good? I do not have a ground truth, so I only used the Shilouette metric but I feel not confident that my model's performance is really good. What is the purpose of ground truth if I am NOT trying to predict in my case and rather describe the current state only?
Just plot the original data then.
I.e., plot data, not X, if that is what you want.
Cluster performance is inherently subjective. It is good if you learn something about your data that you did not know before. As it cannot be captured in equations what you "know" or what is "useful", 8t cannot be reliably evaluated. Any evaluation is just a heuristic. Silhouette is not a good choice, because it punishes noise and non-convex clusters. Internal measures are just like clustering algorithms. External measures compute how well they find something you already know - neither is good for actual data. External measures are popular for scientific papers though to demonstrate that the algorithm isn't complete garbage. You pretend you do not know what you do know, and then check if the algorithm can still find that pattern.
So what do you need to do? Investigate: does it look useful, is it worth trying to use this. Then proceed: Try to use the clustering to solve your problem. It is good if it helps solving your problem.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

How can I get a representative point of a GMM cluster?

I have clustered my data (75000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 4 clusters. Each point of my data represents a molecular structure. Now I would like to get the most representative molecular structure of each cluster which I understand is the centroid of the cluster. So far, I have tried to locate the point (structure) that is right in the centre of the cluster using gmm.means_ attribute, however that exact point does not correspond to any structure (I used numpy.where). I would need to obtain the coordinates of the closest structure to the centroid, but I have not found the function to do that in the documentation of the module (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html). How can I get a representative structure of each cluster?
Thanks a lot for your help, any suggestion will be appreciated.
((As this is a generic question I haven't found necessary to add the code used for the clustering or any data, please let me know if it is necessary))
For each cluster, you can measure its corresponding density for each training point, and choose the point with the maximal density to represent its cluster:
This code can serve as an example:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from sklearn import mixture
n_samples = 100
C = np.array([[0.8, -0.1], [0.2, 0.4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
np.random.randn(n_samples, 2) + np.array([-2, 1]),
np.random.randn(n_samples, 2) + np.array([1, -3])]
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(X)
plt.scatter(X[:,0], X[:, 1], s = 1)
centers = np.empty(shape=(gmm.n_components, X.shape[1]))
for i in range(gmm.n_components):
density = scipy.stats.multivariate_normal(cov=gmm.covariances_[i], mean=gmm.means_[i]).logpdf(X)
centers[i, :] = X[np.argmax(density)]
plt.scatter(centers[:, 0], centers[:, 1], s=20)
plt.show()
It would draw the centers as orange dots:
Find the point with the smallest Mahalanobis distance to the cluster center.
Because GMM uses Mahalanobis distance to assign points. By the GMM model, this is the point with the highest probability of belonging to this cluster.
You have all you need to compute this: cluster means_ and covariances_.

Resources