I am doing Sentiment Analysis of twitter text and want to do it using Maximum Entropy and SVM. I looked up Stanford Classifier but cannot find its implementation in Java. Can anyone guuide from where to start?
Related
I was performing Sentiment Analysis on the IMdb dataset on Kaggle. I used the BOW approach with bigrams and that gave me a decent accuracy of ~89%. But I dont know how to approach the same using word embeddings: Should i go for averaged word vectors or doc2vec?
Someone please help. Thanks in advance.
Here's a recent blog post comparing word2vec averaging vs doc2vec performance. The post favors doc2vec. It also depends on what classification model you are using (logistic regression, SVM, LSTM, etc.)
I have built a text classifier using OneClassSVM.
I have the training set which corresponds to only one label i.e("Yes") and I don't have the other("NO") label data. My task is to build a classifier which classifies the new unseen sentence(test data) as 1 if it is very similar to the training data. Else, it classifies as -1 i.e,(anomaly).
I have used Word2Vec to build the word embeddings for my training data. Then, I am using word-vector averaging with OneClassSVM to build a anomaly detector classifier.
This classifier is currently giving accuracy of about 50%-55%. I have to enhance this further to build a robust classifier.
Any suggestions to this problem would be helpful...
I'd suggest a very different approach since you have no training examples for the negative class at all.
You could train a language model on your training data. At inference time, you score the input with the language model, and classify it according to some threshold on the perplexity of the input sentence according to the LM.
I have tried different approaches like multinomialNB, SVM, MLPClassifier, CNN as well as LSTM network to train the dataset that consists of tweets and labels (big 5 classes - openness, conscientiousness, extraversion, agreeable, neuroticism). But the accuracy is at around 60% even after using word2vec, NRC features & MRC features. Is there something that I can do to improve the accuracy?
Would you please add few more details about the dataset you are using?
For example I would add:
Dataset size (number of samples)
Classes distribution (are they balanced or not)
Do you do any preprocessing?
Without the above information I would just guess but if I were you would try:
clean the tweets from noise e.g usernames,garbage symbols etc.
If the dataset is small
try random search on models (Naive Bayes ,SVM, Logistic regression) using various vectorizations strategies e.g bag of words, tf-idf and do hyper-parameter search
try applying transfer learning from a model trained on tweets, for example for sentiment analysis.
If the dataset is large enough
try neural network approach
Embedding(Glove, word2vec, fasttext) + RNN(LSTM, GRU) + Attention
try training own embedding
use pretrained ones such as those
Embedding + CNN + RNN
Bag of words + FNN
If classes are not balanced
use weighted loss
try to balance them
try stacking multiple models (ensemble)
Hope it helps!
Is the main premise of your project to do personality detection? If not, I would recommend using the Google Sentiment API to calculate sentiment of Twitter data.
Hi I am new to Sentiment Analysis and I am currently using StanfordNLP core api. I am able to get sentiments from sentences, positive, neutral and negative.
Is there any examples I could follow in using the different classifier algorithms provided by the api, such as Naive Bayes and SVM to get the different sentiment score for comparisons. Thank you.
There are currently no other algorithms supported for sentiment analysis. You can, however, train your own without too much difficulty: bigram features with a simple classifier work quite well for sentiment tasks.
I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.