Clustering a long list of words - string

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.
I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weigt, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf
My search in this area is going on still, but I wanted to get ideas from here as well. What would you recommend in this case, is anyone aware of any methods for this kind of problem?

Don't look for clustering. This is misleading. Most algorithms will (more or less forcefully) break your data into a predefined number of groups, no matter what. That k-means isn't the right type of algorithm for your problem should be rather obvious, isn't it?
This sounds very similar; the difference is the scale. A clustering algorithm will produce "macro" clusters, e.g. divide your data set into 10 clusters. What you probably want is that much of your data isn't clustered at all, but you want to want to merge near-duplicate strings, which may stem from errors, right?
Levenshtein distance with a threshold is probably what you need. You can try to accelerate this by using hashing techniques, for example.
Similarly, TF-IDF is the wrong tool. It's used for clustering texts, not strings. TF-IDF is the weight assigned to a single word (string; but it is assumed that this string does not contain spelling errors!) within a larger document. It doesn't work well on short documents, and it won't work at all on single-word strings.

I have encountered the same kind of problem. My approach was to create a graph where each string will be a node and each edge will connect two nodes with weight the similarity of those two strings. You can use edit distance or Sorensen for that. I also set a threshold of 0.2 so that my graph will not be complete thus very computationally heavy. After forming the graph you can use community detection algorithms to detect node communities. Each community is formed with nodes that have a lot of edges with each other, so they will be very similar with each other. You can use networkx or igraph to form the graph and identify each community. So each community will be a cluster of strings. I tested this approach with some string that I wanted to cluster. Here are some of the identified clusters.
University cluster
Council cluster
Committee cluster
I visualised the graph with the gephi tool.
Hope that helps even if it is quite late.

Related

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Determine text similarity through cluster analysis

I am a senior bachelor student in CS and I currently work on my thesis. For this thesis I wrote a program that uses density-based clustering approach. More specifically, OPTICS algorithm. I have an idea of how to use it, but I don't know if it is valid.
I want to use this algorithm for text classification. Texts are points in the set that have to be clustered, so that the resulting hierarchy consists of categories and subcategories of texts. For example, one such set is "Scientific literature", consisting of subsets "Mathematics", "Biology" etc.
I came up with the idea that I can analyze texts for specific words that are encountered in particular text more often than in the whole dataset, also excluding insignificant words like prepositions. Perhaps I can use open source natural language parsers for that purpose, like Stanford parser. After that the program combines these "characteristic words" from each text into one set, and a certain amount of the most frequent words can be taken from this set. That amount becomes the dimentionality for the clustering, and each word's frequency in a particular text is used as a coordinate of a point. Thus we can cluster them.
The question is, is that idea valid or a complete nonsense? Can clustering in general and density-based clustering in particular be used for such classification? Maybe there is some kind of literature that can point me in the right direction?
Clustering != classification.
Run the clustering algorithm, and study the results. Most likely, there will not be a cluster "scientific literature" with subjects "mathematics" - what do you do then?
Also, clusters will only give you sets, that is too coarse for similarity search - on the contrary, you need first to solve the similarity problem, before you can run clustering algorithms such as OPTICS.
The "idea" you described is pretty much what everybody has been trying for years already.

Text Documents Clustering - Non Uniform Clusters

I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.
I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior
Does anyone know what kind of problem am I running into?
Here are the possible things that might be going "wrong":
Your k-means cluster initialization points are chosen as the same set of points in each run. I recommend using the 'random' for the init parameter of k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. If that doesn't work then supply to k-means your own set of random initial cluster centers. Remember to initialize your random generator using its seed() method as the current date and time. https://docs.python.org/2/library/random.html uses current date-time as the default value.
Your distance function, i.e. euclidean distance might be the culprit. This is less likely but it is always good to run k-means using cosine similarity especially when you are using it for document similarity. scikits doesn't have this functionality at present but you should look here: Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
These two combined should give you good clusters.
I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved.
Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.

k-means with ellipsoids

I have n points in R^3 that I want to cover with k ellipsoids or cylinders (I don't really care; whichever is easier). I want to approximately minimize the union of the volumes. Let's say n is tens of thousands and k is a handful. Development time (i.e. simplicity) is more important than runtime.
Obviously I can run k-means and use perfect balls for my ellipsoids. Or I can run k-means, then use minimum enclosing ellipsoids per cluster rather than covering with balls, though in the worst case that's no better. I've seen talk of handling anisotropy with k-means but the links I saw seemed to think I had a tensor in hand; I don't, I just know the data will be a union of ellipsoids. Any suggestions?
[Edit: There's a couple votes for fitting a mixture of multivariate Gaussians, which seems like a viable thing to try. Firing up an EM code to do that won't minimize the volume of the union, but of course k-means doesn't minimize volume either.]
So you likely know k-means is NP-hard, and this problem is even more general (harder). Because you want to do ellipsoids it might make a lot of sense to fit a mixture of k multivariate gaussian distributions. You would probably want to try and find a maximum likelihood solution, which is a non-convex optimization, but at least it's easy to formulate and there is likely code available.
Other than that you're likely to have to write your own heuristic search algorithm from scratch, this is just a huge undertaking.
I did something similar with multi-variate gaussians using this method. The authors use kurtosis as the split measure, and I found it to be a satisfactory method for my application, clustering points obtained from a laser range finder (i.e. computer vision).
If the ellipsoids can overlap a lot,
then methods like k-means that try to assign points to single clusters
won't work very well.
Part of each ellipsoid has to fit the surface of your object,
but the rest may be inside it, don't-cares.
That is, covering algorithms
seem to me quite different from clustering / splitting algorithms;
unions are not splits.
Gaussian mixtures with lots of overlaps ?
No idea, but see the picture and code on Numerical Recipes p. 845.
Coverings are hard even in 2d, see
find-near-minimal-covering-set-of-discs-on-a-2-d-plane.

Incremental clusterig

please suggest some way for efficient incremental clustering. I am trying to put similar strings to one group. comparing with each other is not efficient. what i have thought is to check the each input string with the cluster representative( this means there is one representative pattern for strings in that cluster so that the new string can be compared to that only). So, anything to start with so that the nearly similar strings in a cluster can be represented by one universal pattern(may be) with highest possible accuracy. In this way the new input are just compared with cluster representative and the kept into it if found similar. The number of cluster and input are not fixed...strings are streaming and may be of any pattern length.
I hope i was clear. Just help me with some term to get going.
It sounds like the part of the problem that is giving you difficulty is finding a representative pattern to use for each cluster.
The usual way to do clustering of strings is to treat them as vectors and use cosine similarity as the distance measure: http://en.wikipedia.org/wiki/Cosine_distance
When the strings in the cluster are represented as vectors, then I think the center of the cluster is just the sum of the normalized vectors. Use this sum as the representative to compare each new string against.

Resources