simhash like algorithm to compare two text documents - string

The problem is:
I have a collection of text documents, i want to pick up the most similar one to the input one.
The input text document could be exactly match or modified partly.
The algorithm must be very fast.
Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?

LSH (Locality Sensitive Hashing) techniques are general indexing methods. They are very efficient at finding approximate nearest neighbors.
SimHash is one hashing algorithm for LSH. It uses cosine similarity over real-valued data.
MinHash is another hashing algorithm for LSH. It calculates resemblance similarity over binary vectors.
Mining of Massive Dataset, Chapter 3 by Anand Rajaraman and Jeff Ullman. is good introduction to the problem space and MinHash in particular.

have you tried LSH(locality sensitive Hashing) techniques

Related

Jaccard vs. Cosine similarity for measuring distance between two words (fasttext)

These two distance measurements seem to be the most common in NLP from what I've read. I'm currently using cosine similarity (as does the gensim.fasttext distance measurement). Is there any case to be made for the use of Jaccard instead? Does it even work with only single words as input (with the use of ngrams I suppose)?
ft = fasttext.load_model('cc.en.300.bin')
distance = scipy.spatial.distance.cosine(ft['word1'], ft['word2'])
I suppose I could imagine Jaccard similarity over bags-of-ngrams being useful for something. You could try some experiments to see if it correlates with good performance on some particular word-to-word task.
Maybe: typo correction? Or perhaps, when using a plain, non-Fasttext set-of-word-vectors, you might try synthesizing vectors for OOV words, by some weighted average of the most ngram-Jaccard-similar existing words? (In both cases: other simple comparisons, like edit-distance or shared-substring counting, might do better.)
But, I've not noticed projects using Jaccard-over-ngrams in lieu of whole-word-vector to whole-word-vector comparisons, nor libraries offering it as part of their interfaces/examples.
You've also only described its potential use very vaguely, "with the use of ngrams I suppose", with no code either demonstrating such calculation, or the results of such calculation being put to any use.
So potential usefulness seems like a research conjecture that you'd need to probe with your own experiments.

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

Degree of similarity

I have to compare two documents and find the degree of similarity .
All i need to do is compare two documents and give a number as a result . The number should depict the degree of similarity (Similar documents will have a larger number)
I want an effective means to perform this process . (The similarity is not measured only on the basics of the similar words , but the context must be taken into consideration too.)
Can anyone suggest an effective algorithm for this process
Check out LSA (Latent Sematic Analysis ). This algorithm just checks the similarity of two documents.
Here, you have to learn about the technique called SVD (Singular Value Decompostion)
If you want to implement the document clustering technique, you can try using Matlab and install Matlab-TMG tool.
If you just want a quick, non-mathematical description, and an implementation (in Java), here's a link to an n-gram based solution.
Hint: for free text, use a shingle length of 4 or 5 (this is a parameter to the signature generation algorithm)

Finding related texts(correlation between two texts)

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?
Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.
First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Speed up text comparisons (with sparse matrices)

I have a function which takes two strings and gives out the cosine similarity value which shows the relationship between both texts.
If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.
Is there a way to reduce this number of comparisons? For example sparse matrices or k-means?
I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.
What Ben says it's true, to get better help you need to tell us what's the goal.
For example, one possible optimization if you want to find similar strings is storing the string vectors in a spatial data structure such as a quadtree, where you can outright discard the vectors that are too far away from each other, avoiding many comparisons.
If your algorithm is pair-wise, then you probably can't reduce the number of comparisons, by definition.
You'll need to use a different algorithm, or at the very least pre-process your input if you want to reduce the number of comparisons.
Without the details of your function, it's difficult to give any concrete help.

Resources