Why are TF-IDF vocabulary words represented as axes/dimensions? - nlp

I want an intuitive way for understanding why each word in a TF-IDF vocabulary are represented as separate dimensions.
Why can't I just add the TF-IDF values of all the words together and use that as a representation of the document?
I have a basic understanding of why we do this. Apples =/= Oranges.
But apparently I don't know it well enough to convince someone else!

Ultimately all of NLP is arbitrary. If you wanted to add up the tf-idf values for all words in a phrase/sentence/document and found the resulting number useful for some task you were trying to do you are free to do so. But that number probably won't be very useful for most standard NLP tasks such as search, summarization, sentiment analysis, etc. It's hard to represent the meaning of a phrase/sentence/document with a single number.
By representing a phrase/sentence/document as a vector which has a separate row for each word in your vocabulary, you can leverage vector/matrix algebra to represent some standard operations you might want to do when solving NLP problems. For example, you could compute the cosine similarity between the vectors representing 2 documents and use that to judge how similar those 2 documents are.
Something else you might be interested in: There is an NLP concept called word2vec which lets you represent every word as a different vector of numbers and then lets you add/subtract them to discover semantic relations between them.
For example, it might say
king - man + woman ≈ queen
You can read more about this at https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Related

Calculation of Cosine Similarity of a single word in 2 different Word2Vec Models

I build two word embedding (word2vec models) using gensim and save it as (word2vec1 and word2vec2) by using the model.save(model_name) command for two different corpus (the two corpuses are somewhat similar, similar means they are related like part 1 and part 2 of a book). Suppose, the top words (in terms of frequency or occurrence) for the two corpuses is the same word (let's say it as a).
How to compute the degree of similarity (cosine-similarity or similarity) of the extracted top word (say 'a'), for the two word2vec models? Does most_similar() will work in this case efficiently?
I want to know by how much degree of similarity, does the same word (a), is related for two different generated models?
Any idea is deeply appreciated.
You seem to have the wrong idea about word2vec. It doesn't provide one absolute vector for one word. It manages to find a representation for a word relative to other words. So, for the same corpus, if you run word2vec twice, you will get 2 different vectors for the same word. The meaning comes in when you compare it relative to other word vectors.
king - man will always be close(cosine similarity wise) to queen - woman no matter how many time you train it. But they will have different vectors after each train.
In your case, since the 2 models are trained differently, comparing vectors of the same word is the same as comparing two random vectors. You should rather compare the relative relations. Maybe something like: model1.most_similar('dog') vs model2.most_similar('dog')
However, to answer your question, if you wanted to compare the 2 vectors, you could do it as below. But the results will be meaningless.
Just take the vectors from each model and manually calculate cosine similarity.
vec1 = model1.wv['computer']
vec2 = model2.wv['computer']
print(np.sum(vec1*vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2)))

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.
It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.
When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.
Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.
While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.
A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases

How to find the most meaningful words in the text with using word2vec?

So, for instance, I'm typing, as an input, some sentence with some semantic meaning and, as an output, I get some list of closest (in cosine distance) words (mostly single words).
But I want to understand which cluster my sentence belongs to and compute how far is located each word from it. And eliminate non-meaningful words from sentence.
For example:
"I want to buy a pizza";
"pizza": 0.99123
"buy": 0.7834
"want": 0.1443
How such requirement can be achieved out of the box, without any C coding?
Maybe I need to compute cosine distance equation for this?
Thank you!
It seems like you need topic modeling instead of word2vec. Word2vec is used to capture local information, it is not a good idea to use it directly to classify or clustering words or sentences.
One other aspect can be stop word removal since you are mentioning about non-meaningful words. By the way, they are not non-meaningful, they are actually not aligned with any topic. So, you are thinking them as non-meaningful.
I believe you should use LDA topic modeling approach and you don't need to implement anything since there are many implementation out there for LDA.

Paraphrase recognition using sentence level similarity

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Resources