How to efficiently compute similarity between documents in a stream of documents

How to efficiently compute similarity between documents in a stream of documents - node.js

I gather Text documents (in Node.js) where one document i is represented as a list of words.
What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents?
I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents.
Initially
My first version was to start with the currently available documents, compute a big Term-Document matrix A, and then compute S = A^T x A so that S(i, j) is (after normalization by both norm(doc(i)) and norm(doc(j))) the cos-similarity between documents i and j whose word frequencies are respectively doc(i) and doc(j).
For new documents
What do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.
The troubles
Computing S in Node.js is really long. Way too long in fact! So I decided to create a C++ module which would do the whole thing much faster. And it does! But I cannot wait for it, I should be able to use intermediate results. And what I mean by "not wait for it" is both
a. wait for the computation to be done, but also
b. wait for the matrix A to be built (it's a big one).
Computing new S(k, j) can take advantage of the fact that documents have way less words than the set of all the given words (which I use to build the whole matrix A). Thus, it looks faster to do it in Node.js, avoiding a lot of extra-resource to be taken to access the data.
But is there any better way to do that?
Note : the reason I started computing S is that I can easily build A in Node.js where I have access to all the data, and then do the matrix multiplication in C++ and get it back in Node.js, which speeds the whole thing a lot. But now that computing S gets impracticable, it looks useless.
Note 2 : yep, I don't have to compute the whole S, I can just compute the upper-right elements (or the lower-left ones), but that's not the issue. The time computation issue is not of that order.

If one has to solve it today, just use pre-trained word vectors from fasttext or word2vec

Related

How to measure similarity between sentences inside a cluster after clustering?

I'm conducting topic modeling analysis on messages from public Telegram groups, super new to this area so just learning.
I've been following this example here (https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6), and tried swapping out the HDBSCAN clustering algorithm with the one in BERT's documentation util.community_detection (https://www.sbert.net/docs/package_reference/util.html).
When I output the results of the clusters in this example (4899 Telegram messages), I get something that looks like this.
Topic: just a cluster label
Doc: all the messages in that cluster combined together
0: top keywords found via tf-idf
The problem I'm concerned with is that, there are clearly a ton of messages that are basically identical to each other, I've marked them in yellow. A few examples,
Cluster 3: this is just a bunch of "hellos" and variations thereof
Cluster 5: this is just a bunch of "Ok"s, people saying yes / ok
Cluster 7: people just saying thanks and variations on that
Cluster 9: some variations and misspellings of the word "gas"
Cluster 19: just "siap" which I think means "sorry if I already posted"
To a human reader I feel like this type of text should just be excluded from the analysis altogether, the question is how do I detect it.
Since they're already grouped together by the clustering algorithm, the algorithm must have ways to measure the "similarity" between these messages within a cluster. But I don't seem to be able to find these values exposed anywhere or what it's called. Like for example the HDBSCAN algorithm (https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#), I skimmed through the doc a few times and didn't find any such property or measure exposed, am I missing something here?
My hypothesis is that for the cases where it's just a word or a short phrase repeated over and over again, this similarity value must be super super high, and I'd just say "clusters whose internal similarity is higher than this threshold are getting thrown out".
Any help & advice would be greatly appreciated, thanks!

Index the corpus of your interest (for e.g. FAISS) just for an idea, example code is below:
def build_index(self):
""":returns an inverted index for the search documents"""
vectors = [self.encode(document) for document in self.documents]
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # dimensionality of vector space
# Add document vectors into index after transforming into numpy arrays. IDs should match len(documents)
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), np.array(range(0, len(self.documents))))
return index
Then perform any similarity metric like L2 Euclidean distance or cosine similarity with dot products. Essentially, concept is that once we transform vectors in an n-dimensional space, vectors with similar semantics are grouped together. Therefore, computing similarity is just computing the angle between them and applying a cosine on it. Similar vectors have less angle, therefore higher cosine value & vice-versa.
Check the following topics for your problem.
Cosine Similarity
FAISS
Sentence Vectors (similar to word vectors, but are good for long documents)
Check this repository for a better understanding of sentence vectorization and computing similarity to retrieve top n sentences.
In short,
Create an index file using FAISS for your data of interest.
Compute similarity by calling one of its methods.
Get top n most similar results.
Removing stop words:
Essentially your problem can be attributed to a list of finite stop words. If you can identify ones to some finite value (e.g. some 25) such different key words at max, then the task becomes stop word removal. Please use NLTK / Spacy libraries for easy stop word removal. You can also specify them in a list of strings, write a condition where if a token matches with one of those strings, they’re deleted from downstream processing. Stop words are omitted & is a necessary pre-processing task in NLP. Your task of telegram
data is also similar to Twitter analysis. Check this & this.

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.

There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

Updatable nearest neighbor search

I'm trying to come up with a good design for a nearest neighbor search application. This would be somewhat similar to this question:
Saving and incrementally updating nearest-neighbor model in R
In my case this would be in Python but the main point being the part that when new data comes, the model / index must be updated. I'm currently playing around with scikit-learn neighbors module but I'm not convinced it's a good fit.
The goal of the application:
User comes in with a query and then the n (probably will be fixed to 5) nearest neighbors in the existing data set will be shown. For this step such a search structure from sklearn would help but that would have to be regenerated when adding new records.Also this is a first ste that happens 1 per query and hence could be somewhat "slow" as in 2-3 seconds compared to "instantly".
Then the user can click on one of the records and see that records nearest neighbors and so forth. This means we are now within the exiting dataset and the NNs could be precomputed and stored in redis (for now 200k records but can be expanded to 10th or 100th of millions). This should be very fast to browse around.
But here I would face the same problem of how to update the precomputed data without having to do a full recomputation of the distance matrix especially since there will be very few new records (like 100 per week).
Does such a tool, method or algorithm exist for updatable NN searching?
EDIT April, 3rd:
As is indicated in many places KDTree or BallTree isn't really suited for high-dimensional data. I've realized that for a Proof-of-concept with a small data set of 200k records and 512 dimensions, brute force isn't much slower at all, roughly 550ms vs 750ms.
However for large data set in millions+, the question remains unsolved. I've looked at datasketch LSH Forest but it seems in my case this simply is not accurate enough or I'm using it wrong. Will ask a separate question regarding this.

You should look into FAISS and its IVFPQ method
What you can do there is create multiple indexes for every update and merge them with the old one

You could try out Milvus that supports adding and near real-time search of vectors.
Here are the benchmarks of Milvus.

nmslib supports adding new vectors. It's used by OpenSearch as part their Similarity Search Engine, and it's very fast.
One caveat:
While the HNSW algorithm allows incremental addition of points, it forbids deletion and modification of indexed points.
You can also look into solutions like Milvus or Vearch.

SVD and singular / non-singular matrices

I need to use the SVD form of a matrix to extract concepts from a series of documents. My matrix is of the form A = [d1, d2, d3 ... dN] where di is a binary vector of M components. Then the svd decomposition gives me svd(A) = U x S x V' with S containing the singular values.
I use SVDLIBC to do the processing in nodejs (using a small module I wrote to use it). It seemed to work all well, but I noticed something quite weird in the running time behavior depending on the state of my matrix (where N, M are growing, but already above 1000 for each).
First, I didn't consider extracting the same document vectors, but now after some tests, it looks like adding a document twice sometimes speeds the processing extraordinarily.
Do I have to make sure that each of the columns of A are pairwise-independent? Are they required to be all linearly independent? (I thought nope, since SVD just seems to be performing its job well even with some columns being exactly the same, it will simply show in the resulting decomposition which columns / rows are useless by having 0 components in U or V)
Now that it sometimes takes way too much time to compute the SVD of my big matrix, I was trying to reduce its size by removing the same columns, but I found out that actually adding dummy same vectors can make it way faster. Is that normal? What's happening?
Logically, I'd say that I want my matrix to contain as much information as possible, and thus
[A] Remove all same columns, and in the best case, maybe
[B] Remove linearly dependent columns.
Doing [A] seems pretty simple and not computationally too expensive, I could hash my vectors at construction to check what are the possibly same vectors, and then spend time to check these, but are there good computation techniques for [A] and [B]?
(I'd appreciate for [A] to not have to check equality of a new vector with the whole past vectors the brute-force way, and as for [B], I don't know any good way to check it / do it).
Added related question: about my second question, why would SVD's running time behavior change so massively by just adding one similar column? Is that a normal possible behavior, or does it mean I should look for a bug in SVDLIBC?

It is difficult to say where the problem is without samples of fast and slow input matrices. But, since one of the primary uses of the SVD is to provide a rotation that eliminates covariance, redundant (or the same) columns should not cause problems.
To answer your question about if the slow behavior being a bug in the library you're using, I'd suggest trying to retrieve the SVD of the same matrix using another tool. For example, in Octave, retrieve an SVD of your matrix to compare runtimes:
[U, S, V] = svd(A)

Fast(er) way of matching feature to database

I'm working on a project where I have a feature in an image described as a set of X & Y coordinates (5-10 points per feature) which are unique for this feature. I also have a database with thousands of features where each have the same type of descriptor. The result looks like this:
myFeature: (x1,y1), (x2,y2), (x3,y3)...
myDatabase: Feature1: (x1,y1), (x2,y2), (x3,y3)...
Feature2: (x1,y1), (x2,y2), (x3,y3)...
Feature3: (x1,y1), (x2,y2), (x3,y3)...
...
I want to find the best match of myFeature in the features in myDatabase.
What is the fastest way to match these features? Currently I am stepping though each feature in the database and comparing each individual point:
bestScore = 0
for each feature in myDatabase:
score = 0
for each point descriptor in MyFeature:
find minimum distance from the current point to the...
points describing the current feature in the database
if the distance < threshold:
there is a match to the current point in the target feature
score += 1
if score > bestScore:
save feature as new best match
This search works, but clearly it gets painfully slow on large databases. Does anyone know of a faster method to do this type of search, or at least if there is a way to quickly rule out features that clearly won't match the descriptor?

Create a bitset (an array of 1s and 0s) from each feature.
Create such a bitmask for your search criteria and then just use a bitwise and to compare the search mask to your features.
With this approach, you can shift most work to the routines responsible for saving the stuff. Also, creating the bitmasks should not be that computationally intensive.
If you just want to rule out features that absolutely can't match, then your mask-creation algorithm should take care of that and create the bitmasks a bit fuzzy.
The easiest way to create such masks is probably by creating a matrix as big as the matrix of your features and put a one in every coordinate that is set for the feature and a zero in every coordinate that isn't. Then turn that matrix into a one dimensional row. Compare the feature-row then to the search mask bitwise.
This is similar to the way bitmap indexes work on large databases (oracle e.g.), but with a different intention and without a full bitmap-image of all database rows in memory.
The power of this is in the bitwise comparisons.
On a 32bit machine you can perform 32 comparisons per instruction when you can just do one with integer numbers in a point comparison. It yields even higher boni for floating point operations, depending on the architecture.

This in general looks like a spatial index problem. It's not my field, but you'll probably need to build a sort of tree index, such as a quadtree, that you can use to easily search for features. You can find some links from this wikipedia article: http://en.wikipedia.org/wiki/Spatial_index
It might be a problem that you can easily implement in an existing spatial database. It's very GIS-like in its description.
One thing you can do is calculate a point of gravity for every feature and use that to whittle down the search space a bit (a one dimensional search is a lot easier to build an index for), but that has the downside of being just a heuristic (depending on the shapes of your feature, the point of gravity may end up in weird places).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string