Clustering Components - python-dedupe

When clustering I receive the following warning
UserWarning: A component contained 77760 elements.
Components larger than 30000 are re-filtered.
The threshold for this filtering is 4.08109134074e-15
What does this mean?
My original thereshold specification was 0.191 as below
clustered_dupes = deduper.match(data,threshold=0.191)

the threshold is for the cophenetic similarity of a cluster not pairwise similarity.

Related

Determine the optimal number of biclusters

I have recently performed K-means biclustering on a matrix of absolute correlation coefficient values. However, the biclustering algorithm requires the number of biclusters (k) to be defined as an input. Is there any good method to determine the optimal number of biclusters(k)?
I know from before that many use a silhouette score to estimate the optimal number of clusters but I have only heard that people have used it when performing hierachical clustering. Can the silhouette score also be applied to biclusters as well? Is there any other method to define an optimal number of biclusters? Could a mean squared residue score be used for this?
The biclustering algorithm generated biclusters along the diagonal such that a row or column will never belong to more than one bicluster.

How to select most important features? Feature Engineering

I used the function for gower distance from this link: https://sourceforge.net/projects/gower-distance-4python/files/. My data (df) is such that each row is a trade, and each of the columns are features. Since it contains a lot of categorical data, I then converted the data using gower distance to measure "similarity"... I hope this is correct (as below..):
D = gower_distances(df)
distArray = ssd.squareform(D)
hierarchal_cluster=scipy.cluster.hierarchy.linkage(distArray, method='ward', metric='euclidean', optimal_ordering=False)
I then plot the hierarchical_cluster from above into a dendogram:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
hierarchal_cluster,
truncate_mode='lastp', # show only the last p merged clusters
p=15, # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True # to get a distribution impression in truncated branches
)
I cannot show it, since I do not have enough privilege points, but on the dendogram I can see separate colors.
What is the main discriminator separating them?
How can I find this out?
How can I use PCA to extract useful features?
Do I pass my 'hierarchal_cluster' into a PCA function?
Something like the below..?
pca = PCA().fit(hierarchal_cluster.T)
plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1,1),pca.explained_variance_ratio_.cumsum())
I hope you do know that PCA works only for continuous data? Since you mentioned, there are many categorical features. From what you have written, it occurs that you got mixed data.
A common practice when dealing with mixed data is to separate the continuous and categorical features/variables. Then find the Euclidean distance between data points for continuous (or numerical) features and Hamming distance for the categorical features [1].
This will enable you to find similarity between continuous and categorical feature separately. Now, while you are at this, apply PCA on the continuous variables to extract important features. And apply Multiple Correspondence Analysis MCA on the categorical features. Thereafter, you can combine the obtained relevant features together, and apply any clustering algorithm.
So essentially, I'm suggesting feature selection/feature extraction before clustering.
[1] Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3), pp.283-304.
Quoting the documentation of scipy on the matter of Ward linkage:
Methods ‘centroid’, ‘median’ and ‘ward’ are correctly defined only if Euclidean pairwise metric is used. If y is passed as precomputed pairwise distances, then it is a user responsibility to assure that these distances are in fact Euclidean, otherwise the produced result will be incorrect.
So you can't use Ward linkage with Gower!

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

Interpreting clustering metrics

I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4.
To improve the clustering, I tried two approaches:
After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)
I used PCA to reduce the feature dimension to 2.
I have computed the following metrics (SS, CH, SH)
Method sum_of_squares, Calinski_Harabasz, Silhouette
1 kmeans 31.682 401.3 0.879
2 kmeans+top-features 5989230.351 75863584.45 0.977
3 kmeans+PCA 890.5431893 58479.00277 0.993
My questions are:
As far as I know, if sum of squares is smaller, the performance of clustering method is better, while if Silhouette is close to 1 the performance of clustering method is better. For instance in the last row both sum of squares and Silhouette are increased compared to the first row.
How can I choose which approach has better performance?
Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.
To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...
These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.
With 306 features you are under the curse of dimensionality. Clustering in 306 dimensions is not meaningful. Therefore I wouldn't select features after clustering.
To get interpretable results, you need to reduce dimensionality. For 398 samples you need low dimension (2, 3, maybe 4). Your PCA with dimension 2 is good. You can try 3.
An approach with selecting important features before clustering may be problematic. Anyway, are 2/3/4 "best" features meaningful in your case?

Text Documents Clustering - Non Uniform Clusters

I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.
I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior
Does anyone know what kind of problem am I running into?
Here are the possible things that might be going "wrong":
Your k-means cluster initialization points are chosen as the same set of points in each run. I recommend using the 'random' for the init parameter of k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. If that doesn't work then supply to k-means your own set of random initial cluster centers. Remember to initialize your random generator using its seed() method as the current date and time. https://docs.python.org/2/library/random.html uses current date-time as the default value.
Your distance function, i.e. euclidean distance might be the culprit. This is less likely but it is always good to run k-means using cosine similarity especially when you are using it for document similarity. scikits doesn't have this functionality at present but you should look here: Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
These two combined should give you good clusters.
I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved.
Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.

Resources