Google Natural Language Sentiment Analysis Aggregate Scores - nlp

In this part of the documentation of the Google Cloud Platform Natural Language API, it is described that
The overall score and magnitude values for an entity are an aggregate of the specific score and magnitude values for each mention of the entity.
I can't figure out how this aggregation works. In the example provided in the documentation, Marvin Gaye has two mentions. One of the mentions has a sentiment of 0.4 and a magnitude of 0.4, the other mention has a score of -0.2 and a magnitude 0.2. The aggregate sentiment for Marvin Gaye is score 0.1 and magnitude 0.6.
I have tried other texts myself and can't figure out how the aggregation is made. Does anyone know?

I think it depends on the length of the document, and how are you using some key words, i ran some tests and the results were all different except for a couple when i used a name of a famous person and didn't use any expression showing a emotion, because always got 0.
I can say that it's not an a sum of values, could be some weird operation using the values that is showed in the response.
About the Marvin Gaye example, the result is a mixed sentiment, because of the use of the emotions: "is the best" and "so sad".
Hope this helps with your research.

I contacted Google Cloud Platform Support and got this answer:
"The way the aggregation works is breaking down the input text into smaller components, often ngrams, which is likely why the documentation talks about aggregation, however, the aggregation isn't a simple addition, one can't sum individual sentiment values of each entity to get a total score."
So it doesn't seem possible to give a simple explanation of exactly how the aggregation is made.

Related

Document Sentiment Magnitude != sum(Sentence Magnitude)

I am currently utilizing the google cloud NL api for some tests where I analyze news articles. I was initially curious about how document magnitude was calculated, and searches here yielded
Google Cloud Natural Language API - How is document magnitude calculated?
where it was mentioned to be the sum of constituent sentence magnitudes.
In my own tests, I have found that this was not the case. Is there any thing I might be doing wrong?
For clarity, I am using the running Python 3.7.3 in a conda environment with google-cloud-language obtained from conda-forge.
document =types.Document(content = str, type = enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document)
sentence_sents = sentiment.sentences
test_mag = 0
for sent_obj in sentence_sents:
test_mag += sent_obj.sentiment.magnitude
print(sentiment.document_sentiment.magnitude)
print(test_mag)
From another thread it can be sometimes just the absolute sum but not always.
Google Natural Language Sentiment Analysis Aggregate Scores
"The way the aggregation works is breaking down the input text into smaller components, often ngrams, which is likely why the documentation talks about aggregation, however, the aggregation isn't a simple addition, one can't sum individual sentiment values of each entity to get a total score."
I assume this is the case for score and magnitude calculations.

Information Retrieval: How to combine different word results when using tf-idf?

Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf

sentiment analysis - wordNet , sentiWordNet lexicon

I need a list of positive and negative words with the weights assigned to words according to how strong and week they are. I have got :
1.) WordNet - It gives a + or - score for every word.
2.) SentiWordNet - Giving positive and negative values in the range [0,1].
I checked these on few words,
love - wordNet is giving 0.0 for both noun and verb, I dont know why i think it should be positive by at least some factor.
repress - wordNet gives -9.93
- SentiWordNet gives - 0.0 for both pos and neg. (should be negative)
repose - wordNet - 2.488
- SentiWordNet - { pos - 0.125, neg - 0.5 } (should be positive)
I need some help to decide which one to use.
Thanks.
Quite often the degree and/or polarity may depend on the domain and/or the context, so the word alone isn't really enough to make a decision.
If you have some annotated data, I suggest training a classifier on that using the scores provided by the two resources as features. If you don't, one option is to use one of the available sentiment-annotated corpora that matches the domain in question. Without any data at all the whole task becomes somewhat tricky, although there is a substantial body of work on unsupervised approaches to sentiment classification, I believe, see, e.g. Unsupervised Sentiment Analysis
There is an interface to give different opinions for SentiWordNet, if you think they are wrong:
http://sentiwordnet.isti.cnr.it/search.php?q=repose
I downloaded latest Wordnet 3.1, and checked the file format documentation, and don't see any mention of the sentiment numbers you mention. It is also not shown in the online search.
So, for both those reasons I'd suggest going with SentiWordNet!
(I see your question is a year old, so perhaps you can tell us what you did go with, and why?)
The degree of the polarity depends not only on the words alone but also on the context of the sentece or the phrase.
SO if there are different results regarding the same word then it is because of the difference in the context.

Twitter Subjectivity Training Sets

I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.
Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?
For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.
SentiWordNet : http://sentiwordnet.isti.cnr.it/
Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java
Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf
The other approach I would try:
Example
Tweet 1: #xyz u should see the dark knight. Its awesme.
1) First a dictionary lookup for the for meanings.
"u" and "awesme" will not return anything.
2) Then go against the known abbreviations/shorthands and substitute matches with the expansions
(Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesme.
3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)
Related Link:
Looking for Java spell checker library
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesome.
4) Split and feed the tweet into SWN3, aggregate the result
The problem with this approach is that
a) Negations should be handled outside SWN3.
b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.
There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls
I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.
The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.
You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.
The DynamicLMClassifier in LingPipe works pretty good.
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

Understanding Relevance Score of OpenCalais

I am trying to understand what is the relevance score that opencalais returns associated with each entity? What does it signify and how is it to be interpreted? I would be thankful for insights into this.
Their documentation states: The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important).
While they do not explain what 'relevance' means exactly, one would expect it to quantify the centrality of the entity to the discourse of the document. It's likely influenced by factors such as the entities mention frequency in this document as compared to its expected frequency in a random document (cf. TF-IDF), but could also involve more sophisticated discourse analysis.

Resources