Finding related texts(correlation between two texts) - text

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?

Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.

First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Related

Jaccard vs. Cosine similarity for measuring distance between two words (fasttext)

These two distance measurements seem to be the most common in NLP from what I've read. I'm currently using cosine similarity (as does the gensim.fasttext distance measurement). Is there any case to be made for the use of Jaccard instead? Does it even work with only single words as input (with the use of ngrams I suppose)?
ft = fasttext.load_model('cc.en.300.bin')
distance = scipy.spatial.distance.cosine(ft['word1'], ft['word2'])
I suppose I could imagine Jaccard similarity over bags-of-ngrams being useful for something. You could try some experiments to see if it correlates with good performance on some particular word-to-word task.
Maybe: typo correction? Or perhaps, when using a plain, non-Fasttext set-of-word-vectors, you might try synthesizing vectors for OOV words, by some weighted average of the most ngram-Jaccard-similar existing words? (In both cases: other simple comparisons, like edit-distance or shared-substring counting, might do better.)
But, I've not noticed projects using Jaccard-over-ngrams in lieu of whole-word-vector to whole-word-vector comparisons, nor libraries offering it as part of their interfaces/examples.
You've also only described its potential use very vaguely, "with the use of ngrams I suppose", with no code either demonstrating such calculation, or the results of such calculation being put to any use.
So potential usefulness seems like a research conjecture that you'd need to probe with your own experiments.

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

Why word embedding technique works

I have look into some word embedding techniques, such as
CBOW: from context to single word. Weight matrix produced used as embedding vector
Skip gram: from word to context (from what I see, its acutally word to word, assingle prediction is enough). Again Weight matrix produced used as embedding
Introduction to these tools would always quote "cosine similarity", which says words of similar meanning would convert to similar vector.
But these methods all based on the 'context', account only for words around a target word. I should say they are 'syntagmatic' rather than 'paradigmatic'. So why the close in distance in a sentence indicate close in meaning? I can think of many counter example that frequently occurs
"Have a good day". (good and day are vastly different, though close in distance).
"toilet" "washroom" (two words of similar meaning, but a sentence contains one would unlikely to contain another)
Any possible explanation?
This sort of "why" isn't a great fit for StackOverflow, but some thoughts:
The essence of word2vec & similar embedding models may be compression: the model is forced to predict neighbors using far less internal state than would be required to remember the entire training set. So it has to force similar words together, in similar areas of the parameter space, and force groups of words into various useful relative-relationships.
So, in your second example of 'toilet' and 'washroom', even though they rarely appear together, they do tend to appear around the same neighboring words. (They're synonyms in many usages.) The model tries to predict them both, to similar levels, when typical words surround them. And vice-versa: when they appear, the model should generally predict the same sorts of words nearby.
To achieve that, their vectors must be nudged quite close by the iterative training. The only way to get 'toilet' and 'washroom' to predict the same neighbors, through the shallow feed-forward network, is to corral their word-vectors to nearby places. (And further, to the extent they have slightly different shades of meaning – with 'toilet' more the device & 'washroom' more the room – they'll still skew slightly apart from each other towards neighbors that are more 'objects' vs 'places'.)
Similarly, words that are formally antonyms, but easily stand-in for each-other in similar contexts, like 'hot' and 'cold', will be somewhat close to each other at the end of training. (And, their various nearer-synonyms will be clustered around them, as they tend to be used to describe similar nearby paradigmatically-warmer or -colder words.)
On the other hand, your example "have a good day" probably doesn't have a giant influence on either 'good' or 'day'. Both words' more unique (and thus predictively-useful) senses are more associated with other words. The word 'good' alone can appear everywhere, so has weak relationships everywhere, but still a strong relationship to other synonyms/antonyms on an evaluative ("good or bad", "likable or unlikable", "preferred or disliked", etc) scale.
All those random/non-predictive instances tend to cancel-out as noise; the relationships that have some ability to predict nearby words, even slightly, eventually find some relative/nearby arrangement in the high-dimensional space, so as to help the model for some training examples.
Note that a word2vec model isn't necessarily an effective way to predict nearby words. It might never be good at that task. But the attempt to become good at neighboring-word prediction, with fewer free parameters than would allow a perfect-lookup against training data, forces the model to reflect underlying semantic or syntactic patterns in the data.
(Note also that some research shows that a larger window influences word-vectors to reflect more topical/domain similarity – "these words are used about the same things, in the broad discourse about X" – while a tiny window makes the word-vectors reflect a more syntactic/typical similarity - "these words are drop-in replacements for each other, fitting the same role in a sentence". See for example Levy/Goldberg "Dependency-Based Word Embeddings", around its Table 1.)
‘Embedding’ mean a semantic vector representation. e.g. how to represent words such that synonyms are nearer than antonyms or other unrelated words.
Embeddings algorithms like Word2vec maps entities be it e-commerce
items or words (say in English language), to N-dimensional vectors.
Now since you have a mathematical representation of the entities in
a Euclidean space, you can use associated semantics such as distance
between vectors. e.g:
For a given item say ‘Levis Jeans’ recommend the most related items
which are often co-purchased with it.
This can be easily done: search the nearest vectors to the vector of
‘Levis Jeans’, and recommend them. You will find that the nearest
vectors correspond to items such as T-shirts etc., which are
relevant to the Levis Jeans. Similarly it preserves
distance/similarity between words e.g.: King - Queen = Man - Woman !
Yes, Word2vec captures such co-occurrance relationships, when
mapping the items/words to vectors also called as ‘item/word
embeddings’.
This is not specifically targeted to sentence embeddings but nevertheless here you get some crucial insights extremely relevant to the core logic behind embedding generation. Read till the end.

Applied NLP: how to score a document against a lexicon of multi-word terms?

This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it has one or more occurrences of any of the lexicon entries.
What is a standard NLP technique to do the scoring while accounting for various forms of the words that may appear in the lexicon? What sort of preprocessing would be required for both the input documents and the lexicon to be able to perform the scoring? What sort of open-source tools exist for both the preprocessing and the scoring?
I studied LSI and topic modeling almost a year ago, so what I say should be taken as merely a pointer to give you a general idea of where to look.
There are many different ways to do this with varying degrees of success. This is a hard problem in the realm of information retrieval. You can search for topic modeling to learn about different options and state of the art.
You definitely need some preprocessing and normalization if the words could appear in different forms. How about NLTK and one of its stemmers:
>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('applied')
'apply'
>>> st.stem('applies')
'apply'
You have a lexicon of terms that I am going to call terms and also a bunch of documents. I am going to explore a very basic technique to rank documents with regards to the terms. There are a gazillion more sophisticated ways you can read about, but I think this might be enough if you are not looking for something too sophisticated and rigorous.
This is called a vector space IR model. Terms and documents are both converted to vectors in a k-dimensional space. For that we have to construct a term-by-document matrix. This is a sample matrix in which the numbers represent frequencies of the terms in documents:
So far we have a 3x4 matrix using which each document can be expressed by a 3-dimensional array (each column). But as the number of terms increase, these arrays become too large and increasingly sparse. Also, there are many words such as I or and that occur in most of the documents without adding much semantic content. So you might want to disregard these types of words. For the problem of largeness and sparseness, you can use a mathematical technique called SVD that scales down the matrix while preserving most of the information it contains.
Also, the numbers we used on the above chart were raw counts. Another technique would be to use Boolean values: 1 for presence and 0 zero for lack of a term in a document. But these assume that words have equal semantic weights. In reality, rarer words have more weight than common ones. So, a good way to edit the initial matrix would be to use ranking functions like tf-id to assign relative weights to each term. If by now we have applied SVD to our weighted term-by-document matrix, we can construct the k-dimensional query vectors, which are simply an array of the term weights. If our query contained multiple instances of the same term, the product of the frequency and the term weight would have been used.
What we need to do from there is somewhat straightforward. We compare the query vectors with document vectors by analyzing their cosine similarities and that would be the basis for the ranking of the documents relative to the queries.

How to represent text documents as feature vectors for text classification?

I have around 10,000 text documents.
How to represent them as feature vectors, so that I can use them for text classification?
Is there any tool which does the feature vector representation automatically?
The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.
You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.
To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1 if document i contains word j and v[i,j] = 0 otherwise.
To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list.
If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.
This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.

Resources