Natural language query preprocessing - nlp

I am trying to implement a natural language query preprocessing module which would, given a query formulated in natural language, extract the keywords from that query and submit it to an Information Retrieval (IR) system.
At first, I thought about using some training set to compute tf-idf values of terms and use these values for estimating the importance of single words. But on second thought, this does not make any sense in this scenario - I only have a training collection but I dont have access to index the IR data. Would it be reasonable to only use the idf value for such estimation? Or maybe another weighting approach?
Could you suggest how to tackle this problem? Usually, the articles about NLP processing that I read address training and test data sets. But what if I only have the query and training data?

tf-idf (it's not capitalized, fyi) is a good choice of weight. Your intuition is correct here. However, you don't compute tf-idf on your training set alone. Why? You need to really understand what the tf and idf mean:
tf (term frequency) is a statistic that indicates whether a term appears in the document being evaluated. The simplest way to calculate it would simply be a boolean value, i.e. 1 if the term is in the document.
idf (inverse document frequency), on the other hand, measures how likely a term appears in a random document. It's most often calculated as the log of (N/number of document matches).
Now, tf is calculated for each of the document your IR system will be indexing over (if you don't have the access to do this, then you have a much bigger and insurmountable problem, since an IR without a source of truth is an oxymoron). Ideally, idf is calculated over your entire data set (i.e. all the documents you are indexing), but if this is prohibitively expensive, then you can random sample your population to create a smaller data set, or use a training set such as the Brown corpus.

Related

Evaluation of gensim Doc2Vec model for Recommendations

I have developed a pipeline to extract text from documents, preprocess the text, and train a gensim Doc2vec model on given documents. Given a document in my corpus, I would like to recommend other documents in the corpus.
I want to know how I can evaluate my model without having a pre-defined list of "good" recommendations. Any ideas?
One simple self-check that can be used to catch some big problems with a Doc2Vec model training pipeline – like gross misparameterizations, or insufficient data/epochs – is to re-infer vectors for the training texts (using .infer_vector()), and check that generally:
the bulk-trained vector for the same text is "close to" the re-inferred vector - such as its nearest-neighbor, or one of the top neighbors, in a .most_similar() operation on the re-inferred text
the overall list of nearest-neighbors (from .most_similar()) for the bulk-trained vector, & the re-inferred vector, are very similar.
They won't necessarily be identical, for reasons explained in Q11 & Q12 of the Gensim Project FAQ, but if they're wildly-different, then something foundational has gone wrong, like:
insufficient (in quantity or quality/form) training data
misparameterizations, like too few epochs or too-large (overfitting-prone) vectors for the quantity of data
Ultimately, though, the variety of data sources & intended uses & possible dimensions of "recommendation-worthy" mean that you need cusomt inputs, based on your project's needs, usually from the intended audience (or your own ability to simulate/represent it).
In the original paper introducing the "Paragraph Vector" algorithm (what's inside the Doc2Vec class), and a followup evaluating it on Wikipedia & arXiv articles, several of the evaluations used triplets of documents, where 2 of the triplet were conjectured to be "necessarily similar" based on some preexisting system's groupings, and the 3rd randomly-chosen.
The algorithm's performance, and relative performance under different parameter choices, was scored based on how often it placed the 2 presumptively-related documents closer-together than the 3rd randomly-chosen document.
For example, one of the original paper's evaluations use brief search-engine-result snippets as documents, and considered any 2 documents that appeared as sibling top-10 results for the same query as presumptively-related. Two of the followup paper's evaluation used the human-curated categories of Wikipedia or arXiv as signalling that articles of the same category should be presumptively-related.
It's imperfect, but allowed the creation of large evaluation sets from already-existing systems/data, which generally pointed results in the same direction as human senses-of-relatedness.
Perhaps you can find a similar preexisting guide for your data. Or, as you perform ad-hoc checking, be sure to capture every judgement you make, so that it becomes, over time, a growing dataset of desirable pairings that are either (a) better than some other result that was co-presented; or (b) just "presumably good enough" that they usually should rank higher than other random 3rd documents. A large amount of imprecision in such desirability-data is tolerable, as it can even out as the set of probe-pairings grows, and the power of being able to automate bulk quantitative evaluations (reusing old assessments against new parameters/models) drives far more overall improvement than any small glitches in the evaluations cost.

how to choose the best vector_size for doc2vec?

I am comparing techniques and want to find out what is the best method to vector and reduce dimensions of a large number of text documents. I have already tested Bag of Words and TF-IDF and reduced dimensions with PCA, SVD, and NMF. Using these approaches I can reduce my data and know the best number of dimensions based on the variance explained.
However, I want to do the same with doc2vec, considering that doc2vec itself is a dimensional reducer, what is the best approach to find out the number of dimensions for my model? Is there any statistical measure that helps me find the best number of vector_size?
Thanks in advance!
There's no magic indicator for what's best; you should try a range of dimensionalities to see what scores well on your specific downstream evaluations, given your data & goals.
If using a doc2vec implementation that offers inference of out-of-training set documents (such as via the .infer_vector() method in Python gensim library), then a plausible sanity check for eliminating very-bad choices of vector_size (or other parameters) is to re-infer vectors for training-set documents.
If repeated re-inferences of the same text are are generally "close to" each other, and to the vector for that same document created by the full model training, that's a weak indicator that the model is at least behaving in a self-consistent way. (If the spread of results is large, that might indicate potential problems with insufficient data, too few training epochs, a too-large/overfit model, or other foundational issues.)

Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf?

I am working on a text classification problem. The problem is explained below:
I have a dataset of events which contains three columns - name of the event, description of the event, category of the event. There are about 32 categories in the dataset, such as, travel, sport, education, business etc. I have to classify each event to a category depending on its name and description.
What I understood is this particular task of classification is highly dependent on keywords, rather than, semantics. I am giving you two examples:
If the word 'football' is found either in the name or description or in both, it is highly likely that the event is about sport.
If the word 'trekking' is found either in the name or description or in both, it is highly likely that the event is about travel.
We are not considering multiple categories for an event(however, that's a plan for future !! )
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem. My question is:
Should I do stop word removal and stemming before applying tf-idf or should I apply tf-idf just on raw text? Here text means entries in name of event and description columns.
The question is too generic and you are not providing samples of the dataset, code, and not even indicating the language you are using. To this regard, I will presume that you are using English, since the two words that you are providing as an example are "football" and "trekking". The answer will however necessarily be generic.
Should I do stop word removal
Yes. Have a look at this to see the most frequent words in the English language. As you can see they have no semantic meaning, and thus would not contribute to solving the classification task that you have proposed. if stopwords is a list containing stopwords, the parameter stop_words=stopwords passed to the CountVectorizer or TfidfVectorizer constructor will automatically exclude the stopwords when invoking the .fit_transform() method.
Should I do stemming
It depends. Languages other than English, whose grammar rules allow for a big number of possible prefixes-suffixes, normally require stemming when performing classification task, in order to reach any useful result. The English language however has very poor grammar rules, and thus you can often get away without stemming/lemmatization. You should check the results obtained against the desired accuracy first, and if it is insufficient, try adding a stemming/lemmatization step in the preprocessing of your data. Stemming is a computationally expensive process for large corpora, and I personally use it only for languages that require it.
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem
Careful with this. While tf-idf in practice works with Naive Bayesian classifiers, this is not the way that specific classifier is meant to be used. From the documentation,
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. It is in your best interest to tackle the classification task with CountVectorizer first and score it, and after you have a baseline accuracy for evaluating the TfidfVectorizer, check whether its results are better or worse than those of the CountVectorizer.
If you post some code and a sample of your dataset we can help you with that, otherwise this should be enough.

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.
It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.
When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.
Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.
While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.
A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

Incrementally Trainable Entity Recognition Classifier

I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.
e.g.
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.
e.g.
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.
I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.
As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.
What are better approaches?
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids 87w39423 and 4535k3l535 are definitely distinct things.
However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing?", but I don't think that is what you are after.
You may want to take a look at this method:
"Large Scale Online Learning of Image Similarity Through Ranking" Gal Chechik, Varun Sharma, Uri Shalit and Samy Bengio, Journal of Machine Learning Research (2010).
[PDF] [Project homepage]
More thoughts:
What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'? Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?

Resources