Sources of classified sentiment data? - nlp

I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria.
Easily Classified - must have a like/dislike or 5 star rating
Readily available
Pertain to new material (less important than the first two)
Here are some samples I have come up with on my own.
Etsy API
Rotten Tomatoes API
Yelp API
Any other suggestions would be much appreciated =)

In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.

Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.

Related

simple nltk sentiment analysis code using python3

I am trying to do some classification on customer emails.
Is the email happy or sad (sentiment analysis)
Is the email related to billing or not.
I am using Python3 and think I have to use nltk and scikit
NLTK - will help understand and read the text I beleive
scikit - will do the classification (happy, sad and billing or not)
Training data set 1: A few phrases...anywhere from one word to a sentence with 5 to 6 words.
(1 being happy and 0 being not happy)...a few examples below
Apprecaite the help..1
great job..1
Awesome..1
terrible..0
confusing...0
slow down...0
Training data set 2: a few phrases indicating billing related question..(few examples below)
question on my bill
billing fee
my bill is too high
payment rejected
Now this seems to be straight forward from a concept stand point
where can I find some basic code, that will tell me
how I can use my own training data
how I can load the email text as input and spit out an answer happy or sad...and billing or not.
Regarding your data sets, your approach is nearly lexicon-based as the items contains very few words.
For billing, the lexicon-based approach should be a good idea. You should give importance to the subjects of the emails.
For sentiment analysis you have two options:
Machine learning: In this case you should use a bigger data set (in my view, each item should be a full email). You can implement a Naive Bayes classifier following this tutorial.
Lexicon-based approach: There are several lexicons for sentiment analysis e.g. SentiWordNet (downloadable from nltk.download()), MPQA, SentiStrength, WordNet-Affect via WNAffect,... Preprocessings: tokenization (nltk.word_tokenize()) and POS tagging (nltk.pos_tag(text)). You should also think about negation (polarity shifting is a good approach to manage with negation).
Machine Learning provide best results so if you have enough annotated emails it is the good choice.

Entity Recognition and Sentiment Analysis using NLP

So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Dataset for emotion classification on social media

I would like to do emotion classification on text (posts from social media e.g. tweets, facebook wall posts, youtube comments etc ...). Though I can't find a good dataset with annotated data. I'm looking for more than just data annotated with positive and negative. I'm looking for a dataset with several emotions. This could be or discrete values (ekman 6 basic emotions) or continues values (arousal-valence model). Does anyone know where I can get such a dataset, this can be from twitter, Facebook, Myspace ... as long it is from a social network
well, I think better name (or, more often used) would be Sentiment analysis (Sentiment classification) - correct? I'm not sure if social media do offer their private data (maybe some part of it). Anyway, I found this paper:
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
They are dealing with data: http://www.cs.cornell.edu/people/pabo/movie-review-data/ from https://groups.google.com/forum/?fromgroups#!aboutgroup/rec.arts.movies.reviews.
Does it suit you? Basically, finding appropriate data is usually a big problem in ML. Often it is needed to build your own (I mean to classify a part of it manually and apply some clustering or semi-supervised learning afterwards)
If you don't find anything appropriate on the web, I'd try to contact some authors that write articles similar to your research. Maybe they will have already created datasets that will fit you...

Short text classification

I am about to start a project where my final goal is to classify short texts into classes: "may be interested in visiting place X" : "not interested or neutral". Place is described by set of keywords (e.g. meals or types of miles like "chinese food"). So ideally I need some approach to model desire of user based on short text analysis - and then classify based on a desire score or desire probability - is there any state-of-the-art in this field ? Thank you
This problem is exactly the same as sentiment analysis of texts. But, instead of the traditional binary classification, you seem to have a "neutral" opinion. State-of-the-art in sentiment analysis is highly domain-dependent. Techniques that have excelled in classifying movies do not perform as well on commercial products, for example.
Additionally, even the feature-selection is highly domain-dependent. For example, unigrams work well for movie review classification, but a combination of unigrams and bigrams perform better for classifying twitter texts.
My best advice is to "play around" with different features. Since you are looking at short texts, twitter is probably a good motivational example. I would start with unigrams and bigrams as my features. The exact algorithm is not very important. SVM usually performs very well with correct parameter tuning. Use a small amount of held-out data for tuning these parameters before experimenting on bigger datasets.
The more interesting portion of this problem is the ranking! A "purity score" has been recently used for this purpose in the following papers (and I'd say they are pretty state-of-the-art):
Sentiment summarization: evaluating and learning user preferences. Lerman, Blair-Goldensohn and McDonald. EACL. 2009.
The viability of web-derived polarity lexicons. Velikovich, Blair-Goldensohn, Hannan and McDonald. NAACL. 2010.

rapidminer and sentiment analysis

Is anyone out there used Rapidminer for sentiment analysis... Is this a right combination???
If not how do I get started with a simple sentiment analysis application??
RapidMiner is a very powerful text mining and sentiment analysis tools. I can recommend the RapidMiner training courses offered by Rapid-I. They gave me a really quick start. They also offer a dedicated course on text mining and sentiment analysis:
Sentiment Analysis, Opinion Mining, and Automated Market Research .
Starting in September or October 2009, they will also offer webinars. You should contact them directly, if you would like to learn more about their webinars. Several major online market research companies in Europe and the US are using RapidMiner for opinion mining and sentiment analysis from internet discussions groups and web blogs. For more details and references I would again suggest to simply ask their team at contact(at)rapid-i.com or check their RapidMiner forum at forum.rapid-i.com .
Best regards,
Frank
This series of videos should help:
http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html
When I go to rapid miner site it is confusing me.
http://rapidminer.com/solutions/sentiment-analysis/
"It looks like a crowd sourcing to identify the polarity of product reviews and discussions around the web." If you are looking to automate in real time this might not be a good solution.
spotdy.com offers free NLP for developers. It works pretty cool.
Most of the Sentiment Analysis software tokenize words and giving a positive and negative factor and sum those up. Since language is contextual, this leads to ignoring the context which is not a right way to do.
Instead deep learning models, HMM based on sentence structure. It computes the sentiment based on how words are composed in a sentence. Check out spotdy.com. It is free.

Resources