I have implemented sentiment analysis using the sentiment analysis module of Lingpipe. I know that they use a Dynamic LR model for this. It just tells me if the test string is a positive sentiment or negative sentiment. What ideas could I use to determine the object for which the sentiment has been expressed?
If the text is categorized as positive sentiment, I would like to get the object for which the sentiment has been expressed - this could be a movie name, product name or others.
Although this question is really old but I would like to answer it for others' benefit.
What you want here is concept level sentiment analysis. For a very basic version, I would recommend following these steps:
Apply sentence splitter. You can either use Lingpipe's Sentence Splitter or the OpenNLP Sentence Detector.
Apply part-of-spech tagging. Again you can either use Lingpipe's POS tagger or OpenNLP POS Tagger.
You then need to identify tokens(s) identified as 'Nouns' by the POS tagger. These token(s) have the potential of being the targeted entity in the sentence.
Then you need to find sentiment words in the sentence. The easiest way to do this is by using a dictionary of sentiment bearing words. You can find many such dictionaries online.
The next step will be find out dependency relations in sentences. This can be achieved by using the Stanford Dependency Parser. For example, if you try out the sentence - "This phone is good." in their online demo, you can see the following 'Typed Dependencies':
det(phone-2, This-1),
nsubj(good-4, phone-2),
cop(good-4, is-3),
root(ROOT-0, good-4)
The dependency nsubj(good-4, phone-2) here indicates that phone is the nominal subject of the token good, implying that the word good is expressed for phone. I am sure that your sentiment dictionary will contain the word good and phone would have been identified as a noun by the POS tagger. Thus, you can conclude that the sentiment good was expressed for the entity phone.
This was a very basic example. You can go a step further and create rules around the dependency relations to extract more complex sentiment-entity pairs. You can also assign scores to your sentiment terms and come up with a total score for the sentence depending upon the number of occurrences of sentiment words in that sentence.
Usually sentiment sentence means that the main entity of such sentence is the object of that sentiment. So basic heuristic is to NER and get first object. Otherwise you should use deep parsing NLP toolkits and write some rules to link sentiment to object.
Related
Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.
Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.
A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
(https://arxiv.org/pdf/1607.05368.pdf)
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.
I'm trying to implement as aspect miner based on consumer reviews in amazon for durable- washing machine, refrigerator. The idea is to output sentiment polarity for aspects instead of the entire sentence. For eg: 'Food was good but service was bad' review must output food to be positive and service to be negative. I read through Richard Socher's paper on RNTN model for fine grained sentiment classifier but I guess I'll need to manually tag sentiment for phrases for a different domain and create my own treebank for better accuracy.
Here's an alternate approach I'd thought of. Could someone pls validate/guide me with your feedback
Break the approach into 2 sub tasks. 1) Identify aspects 2) Identify sentiment
Identify aspects
Use POS tagger to identify all nouns. This should shortlist
potentially all aspects in the reviews.
Use word2vec of these nouns to determine similar nouns and reduce the dataset size
Identify sentiments
Train a CNN or dense net model on reviews with rating 1,2,4,5(ignore
3 as we need data that has polarity)
Breakdown the test set reviews into phrases(eg 'Food was good') and then score them using the above model
Find the aspects identified in the 1st sub task and tag them to
their respective phrases.
I don't know how to answer this question but have a few suggestions:
Take a look at multitask learning in neuralnets literature and try an end2end neuralnet for multiple tasks.
Use pretrained word vectors like w2v or glov as inputs.
Don't rely on pos taggers when you use internet data,
Find a way to represent your name entities and oov in your design.
Don't ignore 3!!
You should annotate some data periodically.
lets say I have a sentence:
"you hello how are ?"
I get output of:
you_PRP hello_VBP how_WRB are_VBP
What is best way to arrange the wording into proper English sentence like: Hello how are you ?
I am new to this whole natural language processing so I am unfamiliar with many terms.
The only way I can think of on top of my head is - Using statements to determine:
adverb - verb - noun and then re-arrange them based on that?
Note: Lets assume I am trying to form proper question, so ignore determining if it's a question or a statement.
You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability of occurring than "are how". If you multiply the probabilities of all these two-word sequences in a sentence, you will get the probability of the sentence.
In other words, this is how you can solve your problem:
Find a corpus (either a simple text corpus, or a corpus that has been tagged with part-of-speech tags).
Learn a language model from that corpus. You can do this simply on the basis of the words, or on the basis of the words and their part-of-speech tags, as in your example.
Generate all possible sequences of your target words.
Use the language model to compute the probabilities of all those sequences.
Pick the sequence with the highest probability.
If you work with Python, nltk has an api for training and using language models. Otherwise, KenLM is a popular language modelling package.
For those who are not aware of the Stanford sentiment analyzer, their model predicts the sentiment of the sentence using the sentiment scores of various phrases embedded in the sentence. For e.g:
This movie doesn't care about cleverness, wit or any other kind of intelligent humor.
It has a negative sentiment which it captures using sentiment of various words like :
doesn't, care, witty etc.
It's a bottom up approach where each word has an assigned sentiment, and then the collection of words has a sentiment and so on.
What I need is a reverse process, where the sentiment has been assigned to the sentence and I want to correctly assign the sentiment to the bottom nodes which will lead to this sentiment. I am unable to think of a method to do so.