Dataset for emotion classification on social media - text

I would like to do emotion classification on text (posts from social media e.g. tweets, facebook wall posts, youtube comments etc ...). Though I can't find a good dataset with annotated data. I'm looking for more than just data annotated with positive and negative. I'm looking for a dataset with several emotions. This could be or discrete values (ekman 6 basic emotions) or continues values (arousal-valence model). Does anyone know where I can get such a dataset, this can be from twitter, Facebook, Myspace ... as long it is from a social network

well, I think better name (or, more often used) would be Sentiment analysis (Sentiment classification) - correct? I'm not sure if social media do offer their private data (maybe some part of it). Anyway, I found this paper:
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
They are dealing with data: http://www.cs.cornell.edu/people/pabo/movie-review-data/ from https://groups.google.com/forum/?fromgroups#!aboutgroup/rec.arts.movies.reviews.
Does it suit you? Basically, finding appropriate data is usually a big problem in ML. Often it is needed to build your own (I mean to classify a part of it manually and apply some clustering or semi-supervised learning afterwards)
If you don't find anything appropriate on the web, I'd try to contact some authors that write articles similar to your research. Maybe they will have already created datasets that will fit you...

Related

speech to text training for impaired voice

I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here

What is an appropriate training set size for text classification (Sentiment analysis)

I just wanted to understand (from your experience), that if I have to create a sentiment analysis classification model (using NLTK), what would be a good training data size. For instance if my training data is going to contain tweets, and I intend to classify them as positive,negative and neutral, how many tweets each should I ideally have per category to get a reasonable model working?
I understand that there are many parameters like quality of data, but if one has to get started what might be a good number.
That's a really hard question to answer for people who are not familiar with the exact data, its labelling and the application you want to use it for. But as a ballpark estimate, I would say start with 1,000 examples of each and go from there.

News Article Categorization (Subject / Entity Analysis via NLP?); Preferably in Node.js

Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.

Sources of classified sentiment data?

I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria.
Easily Classified - must have a like/dislike or 5 star rating
Readily available
Pertain to new material (less important than the first two)
Here are some samples I have come up with on my own.
Etsy API
Rotten Tomatoes API
Yelp API
Any other suggestions would be much appreciated =)
In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.
Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.

Need training data for categories like Sports, Entertainment, Health etc and all the sub categories

I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different categories like sports,weather, technology, football,cricket etc,
I need some pointers on where i can find some dataset with these categories,
Another option for me, is to crawl wikipedia to get data for the 30+ categories, but i wanted some brainstorming and opinions, if there is a better way to do this.
Edit
Train the model using the bag of words approach for these categories
Test - classify new/unknown websites to these predefined categories depending on the content of the webpage.
The UCI machine learning repository contains a searchable archive of datasets for supervised learning.
You might get better answers if you provide more specific information about what inputs and outputs your ideal dataset would have.
Edit:
It looks like dmoz has a dump that you can download.
A dataset of newsgroup messages, classified by subject

Resources