Degree of similarity - text

I have to compare two documents and find the degree of similarity .
All i need to do is compare two documents and give a number as a result . The number should depict the degree of similarity (Similar documents will have a larger number)
I want an effective means to perform this process . (The similarity is not measured only on the basics of the similar words , but the context must be taken into consideration too.)
Can anyone suggest an effective algorithm for this process

Check out LSA (Latent Sematic Analysis ). This algorithm just checks the similarity of two documents.
Here, you have to learn about the technique called SVD (Singular Value Decompostion)
If you want to implement the document clustering technique, you can try using Matlab and install Matlab-TMG tool.

If you just want a quick, non-mathematical description, and an implementation (in Java), here's a link to an n-gram based solution.
Hint: for free text, use a shingle length of 4 or 5 (this is a parameter to the signature generation algorithm)

Related

Jaccard vs. Cosine similarity for measuring distance between two words (fasttext)

These two distance measurements seem to be the most common in NLP from what I've read. I'm currently using cosine similarity (as does the gensim.fasttext distance measurement). Is there any case to be made for the use of Jaccard instead? Does it even work with only single words as input (with the use of ngrams I suppose)?
ft = fasttext.load_model('cc.en.300.bin')
distance = scipy.spatial.distance.cosine(ft['word1'], ft['word2'])
I suppose I could imagine Jaccard similarity over bags-of-ngrams being useful for something. You could try some experiments to see if it correlates with good performance on some particular word-to-word task.
Maybe: typo correction? Or perhaps, when using a plain, non-Fasttext set-of-word-vectors, you might try synthesizing vectors for OOV words, by some weighted average of the most ngram-Jaccard-similar existing words? (In both cases: other simple comparisons, like edit-distance or shared-substring counting, might do better.)
But, I've not noticed projects using Jaccard-over-ngrams in lieu of whole-word-vector to whole-word-vector comparisons, nor libraries offering it as part of their interfaces/examples.
You've also only described its potential use very vaguely, "with the use of ngrams I suppose", with no code either demonstrating such calculation, or the results of such calculation being put to any use.
So potential usefulness seems like a research conjecture that you'd need to probe with your own experiments.

Determine text similarity through cluster analysis

I am a senior bachelor student in CS and I currently work on my thesis. For this thesis I wrote a program that uses density-based clustering approach. More specifically, OPTICS algorithm. I have an idea of how to use it, but I don't know if it is valid.
I want to use this algorithm for text classification. Texts are points in the set that have to be clustered, so that the resulting hierarchy consists of categories and subcategories of texts. For example, one such set is "Scientific literature", consisting of subsets "Mathematics", "Biology" etc.
I came up with the idea that I can analyze texts for specific words that are encountered in particular text more often than in the whole dataset, also excluding insignificant words like prepositions. Perhaps I can use open source natural language parsers for that purpose, like Stanford parser. After that the program combines these "characteristic words" from each text into one set, and a certain amount of the most frequent words can be taken from this set. That amount becomes the dimentionality for the clustering, and each word's frequency in a particular text is used as a coordinate of a point. Thus we can cluster them.
The question is, is that idea valid or a complete nonsense? Can clustering in general and density-based clustering in particular be used for such classification? Maybe there is some kind of literature that can point me in the right direction?
Clustering != classification.
Run the clustering algorithm, and study the results. Most likely, there will not be a cluster "scientific literature" with subjects "mathematics" - what do you do then?
Also, clusters will only give you sets, that is too coarse for similarity search - on the contrary, you need first to solve the similarity problem, before you can run clustering algorithms such as OPTICS.
The "idea" you described is pretty much what everybody has been trying for years already.

Applied NLP: how to score a document against a lexicon of multi-word terms?

This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it has one or more occurrences of any of the lexicon entries.
What is a standard NLP technique to do the scoring while accounting for various forms of the words that may appear in the lexicon? What sort of preprocessing would be required for both the input documents and the lexicon to be able to perform the scoring? What sort of open-source tools exist for both the preprocessing and the scoring?
I studied LSI and topic modeling almost a year ago, so what I say should be taken as merely a pointer to give you a general idea of where to look.
There are many different ways to do this with varying degrees of success. This is a hard problem in the realm of information retrieval. You can search for topic modeling to learn about different options and state of the art.
You definitely need some preprocessing and normalization if the words could appear in different forms. How about NLTK and one of its stemmers:
>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('applied')
'apply'
>>> st.stem('applies')
'apply'
You have a lexicon of terms that I am going to call terms and also a bunch of documents. I am going to explore a very basic technique to rank documents with regards to the terms. There are a gazillion more sophisticated ways you can read about, but I think this might be enough if you are not looking for something too sophisticated and rigorous.
This is called a vector space IR model. Terms and documents are both converted to vectors in a k-dimensional space. For that we have to construct a term-by-document matrix. This is a sample matrix in which the numbers represent frequencies of the terms in documents:
So far we have a 3x4 matrix using which each document can be expressed by a 3-dimensional array (each column). But as the number of terms increase, these arrays become too large and increasingly sparse. Also, there are many words such as I or and that occur in most of the documents without adding much semantic content. So you might want to disregard these types of words. For the problem of largeness and sparseness, you can use a mathematical technique called SVD that scales down the matrix while preserving most of the information it contains.
Also, the numbers we used on the above chart were raw counts. Another technique would be to use Boolean values: 1 for presence and 0 zero for lack of a term in a document. But these assume that words have equal semantic weights. In reality, rarer words have more weight than common ones. So, a good way to edit the initial matrix would be to use ranking functions like tf-id to assign relative weights to each term. If by now we have applied SVD to our weighted term-by-document matrix, we can construct the k-dimensional query vectors, which are simply an array of the term weights. If our query contained multiple instances of the same term, the product of the frequency and the term weight would have been used.
What we need to do from there is somewhat straightforward. We compare the query vectors with document vectors by analyzing their cosine similarities and that would be the basis for the ranking of the documents relative to the queries.

simhash like algorithm to compare two text documents

The problem is:
I have a collection of text documents, i want to pick up the most similar one to the input one.
The input text document could be exactly match or modified partly.
The algorithm must be very fast.
Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?
LSH (Locality Sensitive Hashing) techniques are general indexing methods. They are very efficient at finding approximate nearest neighbors.
SimHash is one hashing algorithm for LSH. It uses cosine similarity over real-valued data.
MinHash is another hashing algorithm for LSH. It calculates resemblance similarity over binary vectors.
Mining of Massive Dataset, Chapter 3 by Anand Rajaraman and Jeff Ullman. is good introduction to the problem space and MinHash in particular.
have you tried LSH(locality sensitive Hashing) techniques

Finding related texts(correlation between two texts)

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?
Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.
First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Resources