I am trying to do some classification on customer emails.
Is the email happy or sad (sentiment analysis)
Is the email related to billing or not.
I am using Python3 and think I have to use nltk and scikit
NLTK - will help understand and read the text I beleive
scikit - will do the classification (happy, sad and billing or not)
Training data set 1: A few phrases...anywhere from one word to a sentence with 5 to 6 words.
(1 being happy and 0 being not happy)...a few examples below
Apprecaite the help..1
great job..1
Awesome..1
terrible..0
confusing...0
slow down...0
Training data set 2: a few phrases indicating billing related question..(few examples below)
question on my bill
billing fee
my bill is too high
payment rejected
Now this seems to be straight forward from a concept stand point
where can I find some basic code, that will tell me
how I can use my own training data
how I can load the email text as input and spit out an answer happy or sad...and billing or not.
Regarding your data sets, your approach is nearly lexicon-based as the items contains very few words.
For billing, the lexicon-based approach should be a good idea. You should give importance to the subjects of the emails.
For sentiment analysis you have two options:
Machine learning: In this case you should use a bigger data set (in my view, each item should be a full email). You can implement a Naive Bayes classifier following this tutorial.
Lexicon-based approach: There are several lexicons for sentiment analysis e.g. SentiWordNet (downloadable from nltk.download()), MPQA, SentiStrength, WordNet-Affect via WNAffect,... Preprocessings: tokenization (nltk.word_tokenize()) and POS tagging (nltk.pos_tag(text)). You should also think about negation (polarity shifting is a good approach to manage with negation).
Machine Learning provide best results so if you have enough annotated emails it is the good choice.
Related
I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.
I'm trying to implement as aspect miner based on consumer reviews in amazon for durable- washing machine, refrigerator. The idea is to output sentiment polarity for aspects instead of the entire sentence. For eg: 'Food was good but service was bad' review must output food to be positive and service to be negative. I read through Richard Socher's paper on RNTN model for fine grained sentiment classifier but I guess I'll need to manually tag sentiment for phrases for a different domain and create my own treebank for better accuracy.
Here's an alternate approach I'd thought of. Could someone pls validate/guide me with your feedback
Break the approach into 2 sub tasks. 1) Identify aspects 2) Identify sentiment
Identify aspects
Use POS tagger to identify all nouns. This should shortlist
potentially all aspects in the reviews.
Use word2vec of these nouns to determine similar nouns and reduce the dataset size
Identify sentiments
Train a CNN or dense net model on reviews with rating 1,2,4,5(ignore
3 as we need data that has polarity)
Breakdown the test set reviews into phrases(eg 'Food was good') and then score them using the above model
Find the aspects identified in the 1st sub task and tag them to
their respective phrases.
I don't know how to answer this question but have a few suggestions:
Take a look at multitask learning in neuralnets literature and try an end2end neuralnet for multiple tasks.
Use pretrained word vectors like w2v or glov as inputs.
Don't rely on pos taggers when you use internet data,
Find a way to represent your name entities and oov in your design.
Don't ignore 3!!
You should annotate some data periodically.
So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html
I am about to start a project where my final goal is to classify short texts into classes: "may be interested in visiting place X" : "not interested or neutral". Place is described by set of keywords (e.g. meals or types of miles like "chinese food"). So ideally I need some approach to model desire of user based on short text analysis - and then classify based on a desire score or desire probability - is there any state-of-the-art in this field ? Thank you
This problem is exactly the same as sentiment analysis of texts. But, instead of the traditional binary classification, you seem to have a "neutral" opinion. State-of-the-art in sentiment analysis is highly domain-dependent. Techniques that have excelled in classifying movies do not perform as well on commercial products, for example.
Additionally, even the feature-selection is highly domain-dependent. For example, unigrams work well for movie review classification, but a combination of unigrams and bigrams perform better for classifying twitter texts.
My best advice is to "play around" with different features. Since you are looking at short texts, twitter is probably a good motivational example. I would start with unigrams and bigrams as my features. The exact algorithm is not very important. SVM usually performs very well with correct parameter tuning. Use a small amount of held-out data for tuning these parameters before experimenting on bigger datasets.
The more interesting portion of this problem is the ranking! A "purity score" has been recently used for this purpose in the following papers (and I'd say they are pretty state-of-the-art):
Sentiment summarization: evaluating and learning user preferences. Lerman, Blair-Goldensohn and McDonald. EACL. 2009.
The viability of web-derived polarity lexicons. Velikovich, Blair-Goldensohn, Hannan and McDonald. NAACL. 2010.
I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.