Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf?

Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf? - nlp

I am working on a text classification problem. The problem is explained below:
I have a dataset of events which contains three columns - name of the event, description of the event, category of the event. There are about 32 categories in the dataset, such as, travel, sport, education, business etc. I have to classify each event to a category depending on its name and description.
What I understood is this particular task of classification is highly dependent on keywords, rather than, semantics. I am giving you two examples:
If the word 'football' is found either in the name or description or in both, it is highly likely that the event is about sport.
If the word 'trekking' is found either in the name or description or in both, it is highly likely that the event is about travel.
We are not considering multiple categories for an event(however, that's a plan for future !! )
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem. My question is:
Should I do stop word removal and stemming before applying tf-idf or should I apply tf-idf just on raw text? Here text means entries in name of event and description columns.

The question is too generic and you are not providing samples of the dataset, code, and not even indicating the language you are using. To this regard, I will presume that you are using English, since the two words that you are providing as an example are "football" and "trekking". The answer will however necessarily be generic.
Should I do stop word removal
Yes. Have a look at this to see the most frequent words in the English language. As you can see they have no semantic meaning, and thus would not contribute to solving the classification task that you have proposed. if stopwords is a list containing stopwords, the parameter stop_words=stopwords passed to the CountVectorizer or TfidfVectorizer constructor will automatically exclude the stopwords when invoking the .fit_transform() method.
Should I do stemming
It depends. Languages other than English, whose grammar rules allow for a big number of possible prefixes-suffixes, normally require stemming when performing classification task, in order to reach any useful result. The English language however has very poor grammar rules, and thus you can often get away without stemming/lemmatization. You should check the results obtained against the desired accuracy first, and if it is insufficient, try adding a stemming/lemmatization step in the preprocessing of your data. Stemming is a computationally expensive process for large corpora, and I personally use it only for languages that require it.
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem
Careful with this. While tf-idf in practice works with Naive Bayesian classifiers, this is not the way that specific classifier is meant to be used. From the documentation,
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. It is in your best interest to tackle the classification task with CountVectorizer first and score it, and after you have a baseline accuracy for evaluating the TfidfVectorizer, check whether its results are better or worse than those of the CountVectorizer.
If you post some code and a sample of your dataset we can help you with that, otherwise this should be enough.

Related

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.

It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?

A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.

It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.

I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.

When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.

Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.

While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.

A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.

In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.

Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Best for resume, document matching

I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.

Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.

As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila

Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.

Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?

This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.

Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string