How can i cluster document using k-means (Flann with python)? - nlp

I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
Is k-means the right algorithm?

You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.
So here's what you do:
Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.
Here's an example.
If your "documents" are single sentences, and they look like (one doc per line):
there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog
If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:
[1, 1, 0, 0, 0] // dog 1 time, cat 1 time
[0, 0, 0, 1, 1] // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0] // dog 2 times, cat 1 time, street 1 time
You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.

There is one big problem here:
K-means is designed for Euclidean distance.
The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).
Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.
For text vectors, you probably will want to use a different distance function such as cosine similarity.
And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.
There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

Related

How to measure similarity between sentences inside a cluster after clustering?

I'm conducting topic modeling analysis on messages from public Telegram groups, super new to this area so just learning.
I've been following this example here (https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6), and tried swapping out the HDBSCAN clustering algorithm with the one in BERT's documentation util.community_detection (https://www.sbert.net/docs/package_reference/util.html).
When I output the results of the clusters in this example (4899 Telegram messages), I get something that looks like this.
Topic: just a cluster label
Doc: all the messages in that cluster combined together
0: top keywords found via tf-idf
The problem I'm concerned with is that, there are clearly a ton of messages that are basically identical to each other, I've marked them in yellow. A few examples,
Cluster 3: this is just a bunch of "hellos" and variations thereof
Cluster 5: this is just a bunch of "Ok"s, people saying yes / ok
Cluster 7: people just saying thanks and variations on that
Cluster 9: some variations and misspellings of the word "gas"
Cluster 19: just "siap" which I think means "sorry if I already posted"
To a human reader I feel like this type of text should just be excluded from the analysis altogether, the question is how do I detect it.
Since they're already grouped together by the clustering algorithm, the algorithm must have ways to measure the "similarity" between these messages within a cluster. But I don't seem to be able to find these values exposed anywhere or what it's called. Like for example the HDBSCAN algorithm (https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#), I skimmed through the doc a few times and didn't find any such property or measure exposed, am I missing something here?
My hypothesis is that for the cases where it's just a word or a short phrase repeated over and over again, this similarity value must be super super high, and I'd just say "clusters whose internal similarity is higher than this threshold are getting thrown out".
Any help & advice would be greatly appreciated, thanks!
Index the corpus of your interest (for e.g. FAISS) just for an idea, example code is below:
def build_index(self):
""":returns an inverted index for the search documents"""
vectors = [self.encode(document) for document in self.documents]
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # dimensionality of vector space
# Add document vectors into index after transforming into numpy arrays. IDs should match len(documents)
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), np.array(range(0, len(self.documents))))
return index
Then perform any similarity metric like L2 Euclidean distance or cosine similarity with dot products. Essentially, concept is that once we transform vectors in an n-dimensional space, vectors with similar semantics are grouped together. Therefore, computing similarity is just computing the angle between them and applying a cosine on it. Similar vectors have less angle, therefore higher cosine value & vice-versa.
Check the following topics for your problem.
Cosine Similarity
FAISS
Sentence Vectors (similar to word vectors, but are good for long documents)
Check this repository for a better understanding of sentence vectorization and computing similarity to retrieve top n sentences.
In short,
Create an index file using FAISS for your data of interest.
Compute similarity by calling one of its methods.
Get top n most similar results.
Removing stop words:
Essentially your problem can be attributed to a list of finite stop words. If you can identify ones to some finite value (e.g. some 25) such different key words at max, then the task becomes stop word removal. Please use NLTK / Spacy libraries for easy stop word removal. You can also specify them in a list of strings, write a condition where if a token matches with one of those strings, they’re deleted from downstream processing. Stop words are omitted & is a necessary pre-processing task in NLP. Your task of telegram
data is also similar to Twitter analysis. Check this & this.

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

Are the features of Word2Vec independent each other?

I am new to NLP and studying Word2Vec. So I am not fully understanding the concept of Word2Vec.
Are the features of Word2Vec independent each other?
For example, suppose there is a 100-dimensional word2vec. Then the 100 features are independent each other? In other words, if the "sequence" of the features are shuffled, then the meaning of word2vec is changed?
Word2vec is a 'dense' embedding: the individual dimensions generally aren't independently interpretable. It's just the 'neighborhoods' and 'directions' (not limited to the 100 orthogonal axis dimensions) that have useful meanings.
So, they're not 'independent' of each other in a statistical sense. But, you can discard any of the dimensions – for example, the last 50 dimensions of all your 100-dimensional vectors – and you still have usable word-vectors. So in that sense they're still independently useful.
If you shuffled the order-of-dimensions, the same way for every vector in your set, you've then essentially just rotated/reflected all the vectors similarly. They'll all have different coordinates, but their relative distances will be the same, and if "going toward word B from word A" used to vaguely indicate some human-understandable aspect like "largeness", then even after performing your order-of-dimensions shuffle, "going towards word B from word A" will mean the same thing, because the vectors "thataway" (in the transformed coordinates) will be the same as before.
The first thing to understand here is that how word2Vec is formalized. Shifting away from traditional representations of words, the word2vec model tries to encode the meaning of the world into different features. For eg lets say every word in the english dictionary can be manifested in a set of say '4' features. The features could be , lets say "f1":"gender", "f2":"color","f3":"smell","f4":"economy".
So now when a word2vec vector is written , what it signifies is how much manifestation of a particular feature it has. Lets take an example to understand this. Consider a Man(V1) who is dark,not so smelly and is not very rich and is neither poor. Then the first feature ie gender is represented as 1 (since we are taking 1 as male and -1 as female). The second feature color is -1 here as it is exactly opposite to white (which we are taking as 1). Smell and economy are similary given 0.3 and 0.4 values.
Now consider another man(V2) who also has the same anatomy and social status like the first man. Then his word2vec vector would also be similar.
V1=>[1,-1,0.3,0.4]
V2=>[1,-1,0.4,0.3]
This kind of representation helps us represent words into features that are independent or orthogonal to each other.The orthogonality helps in finding similarity or dissimilarity based on some mathematical operation lets say cosine dot product.
The sequence of the number in a word2vec is important since every number represents the weight of a particular feature: gender, color,smell,economy. So shuffling the positions would result in a completely different vector

Information Retrieval: How to combine different word results when using tf-idf?

Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf

Euclidean vs Cosine for text data

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...
I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.
Cosine similarity:
0.0, 0.2967, 0.203, 0.2058
Euclidean distance:
0.0, 0.285, 0.2407, 0.2421
Note: If this question is more suitable to Cross Validation or Data Science, please let me know.
If your data is normalized to unit length, then it is very easy to prove that
Euclidean(A,B) = 2 - Cos(A,B)
This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...
Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Resources