It appears that the simplest, naivest way to do basic sentiment analysis is with a Bayesian classifier (confirmed by what I'm finding here on SO). Any counter-arguments or other suggestions?
A Bayesian classifier with a bag of words representation is the simplest statistical method. You can get significantly better results by moving to more advanced classifiers and feature representation, at the cost of more complexity.
Statistical methods aren't the only game in town. Rule based methods that have more understanding of the structure of the text are the other main option. From what I have seen, these don't actually perform as well as statistical methods.
I recommend Manning and Schütze's Foundations of Statistical Natural Language Processing chapter 16, Text Categorization.
I can't think of a simpler, more naive way to do Sentiment Analysis, but you might consider using a Support Vector Machine instead of Naive Bayes (in some machine learning toolkits, this can be a drop-in replacement). Have a look at "Thumbs up? Sentiment Classification using Machine Learning Techniques" by Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan which was one of the earliest papers on these techniques, and gives a good table of accuracy results on a family of related techniques, none of which are any more complicated (from a client perspective) than any of the others.
Building upon the answer provided by Ken above, there is another paper
"Sentiment analysis using support vector machines with diverse information sources" by Tony and Niger,
which looks at assigning more features than just a bag of words used by Pang and Lee. Here, they leverage wordnet to determine semantic differentiation of adjectives, and proximity of the sentiment towards the topic in the text, as additional features for SVM. They show better results than previous attempts to classify text based on sentiment.
Related
Currently, I'm working on a project where I need to extract the relevant aspects used in positive and negative reviews in real time.
For the notions of more negative and positive, it will be a question of contextualizing the word. Distinguish between a word that sounds positive in a negative context (consider irony).
Here is an example:
Very nice welcome!!! We ate very well with traditional dishes as at home, the quality but also the quantity are in appointment!!!*
Positive aspects: welcome, traditional dishes, quality, quantity
Can anyone suggest to me some tutorials, papers or ideas about this topic?
Thank you in advance.
This task is called Aspect Based Sentiment Analysis (ABSA). Most popular is the format and dataset specified in the 2014 Semantic Evaluation Workshop (Task 5) and its updated versions in the following years.
Overview of model efficiencies over the years:
https://paperswithcode.com/sota/aspect-based-sentiment-analysis-on-semeval
Good source for ressources and repositories on the topic (some are very advanced but there are some more starter friendly ressources in there too):
https://github.com/ZhengZixiang/ABSAPapers
Just from my general experience in this topic a very powerful starting point that doesn't require advanced knowledge in machine learning model design is to prepare a Dataset (such as the one provided for the SemEval2014 Task) that is in a Token Classification Format and use it to fine-tune a pretrained transformer model such as BERT, RoBERTa or similar. Check out any tutorial on how to do fine-tuning on a token classification model like this one in huggingface. They usually use the popular task of Named Entity Recognition (NER) as the example task but for the ABSA-Task you basically do the same thing but with other labels and a different dataset.
Obviously an even easier approach would be to take more rule-based approaches or combine a rule-based approach with a trained sentiment analysis model/negation detection etc., but I think generally with a rule-based approach you can expect a much inferior performance compared to using state-of-the-art models as transformers.
If you want to go even more advanced than just fine-tuning the pretrained transformer models then check out the second and third link I provided and look at some of the machine learning model designs specifically designed for Aspect Based Sentiment Analysis.
I would like to apply fine-tuning Bert to calculate semantic similarity between sentences.
I search a lot websites, but I almost not found downstream about this.
I just found STS benchmark.
I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task.
Is it reasonable?
As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc.
How choose for semantic similarity?
In addition, if you're after a binary verdict (yes/no for 'semantically similar'), BERT was actually benchmarked on this task, using the MRPC (Microsoft Research Paraphrase Corpus).
The google github repo https://github.com/google-research/bert includes some example calls for this, see --task_name=MRPC in section Sentence (and sentence-pair) classification tasks.
As a general remark ahead, I want to stress that this kind of question might not be considered on-topic on Stackoverflow, see How to ask. There are, however, related sites that might be better for these kinds of questions (no code, theoretical PoV), namely AI Stackexchange, or Cross Validated.
If you look at a rather popular paper in the field by Mueller and Thyagarajan, which is concerned with learning sentence similarity on LSTMs, they use a closely related dataset (the SICK dataset), which is also hosted by the SemEval competition, and ran alongside the STS benchmark in 2014.
Either one of those should be a reasonable set to fine-tune on, but STS has run over multiple years, so the amount of available training data might be larger.
As a great primer on the topic, I can also highly recommend the Medium article by Adrien Sieg (see here, which comes with an accompanied GitHub reference.
For semantic similarity, I would estimate that you are better of with fine-tuning (or training) a neural network, as most classical similarity measures you mentioned have a more prominent focus on the token similarity (and thus, syntactic similarity, although not even that necessarily). Semantic meaning, on the other hand, can sometimes differ wildly on a single word (maybe a negation, or the swapped sentence position of two words), which is difficult to interpret or evaluate with static methods.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
I am about to start a project where my final goal is to classify short texts into classes: "may be interested in visiting place X" : "not interested or neutral". Place is described by set of keywords (e.g. meals or types of miles like "chinese food"). So ideally I need some approach to model desire of user based on short text analysis - and then classify based on a desire score or desire probability - is there any state-of-the-art in this field ? Thank you
This problem is exactly the same as sentiment analysis of texts. But, instead of the traditional binary classification, you seem to have a "neutral" opinion. State-of-the-art in sentiment analysis is highly domain-dependent. Techniques that have excelled in classifying movies do not perform as well on commercial products, for example.
Additionally, even the feature-selection is highly domain-dependent. For example, unigrams work well for movie review classification, but a combination of unigrams and bigrams perform better for classifying twitter texts.
My best advice is to "play around" with different features. Since you are looking at short texts, twitter is probably a good motivational example. I would start with unigrams and bigrams as my features. The exact algorithm is not very important. SVM usually performs very well with correct parameter tuning. Use a small amount of held-out data for tuning these parameters before experimenting on bigger datasets.
The more interesting portion of this problem is the ranking! A "purity score" has been recently used for this purpose in the following papers (and I'd say they are pretty state-of-the-art):
Sentiment summarization: evaluating and learning user preferences. Lerman, Blair-Goldensohn and McDonald. EACL. 2009.
The viability of web-derived polarity lexicons. Velikovich, Blair-Goldensohn, Hannan and McDonald. NAACL. 2010.
I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.
I am planning to use Naive Bayes Classification, is it advisable?
Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.
Most important is the choice of features + the amount/quality of data you provide!
With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.
EDIT:
The question How do you find the subject of a sentence? and my answer are also worth looking at.
Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.
The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.
If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.
This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:
http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl=us&pid=bl&srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1