what is the formula of sentiment calculation - nlp

what is the actual formula to compute sentiments using sentiment rated lexicon. the lexicon that I am using contains rating between the range -5 to 5. I want to compute sentiment for individual sentences. Either i have to compute average of all sentiment ranked words in sentence or only sum up them.

There are several methods for computing an index from scored sentiment components of sentences. Each is based on comparing positive and negative words, and each has advantages and disadvantages.
For your scale, a measure of the central tendency of the words would be a fair measure, where the denominator is the number of scored words. This is a form of the "relative proportional difference" measure employed below. You would probably not want to divide the total sentiment words' scores by all words, since this makes each sentence's measure strongly affected by non-sentiment terms.
If you do not believe that the 11 point rating you describe is accurate, you could just classify it as positive or negative depending on its sign. Then you could apply the following methods where you have transformed
where each P and N refer to the counts of the Positive and Negative coded sentiment words, and O is the count of all other words (so that the total number of words = P + N + O).
Absolute Proportional Difference. Bounds: [0,1]
Sentiment = (P − N) / (P + N + O)
Disadvantage: A sentence's score is affected by non-sentiment-related content.
Relative Proportional Difference. Bounds: [-1, 1]
Sentiment = (P − N) / (P + N)
Disadvantage: A sentence's score may tend to cluster very strongly near the scale endpoints (because they may contain content primarily or exclusively of either positive or negative).
Logit scale. Bounds: [-infinity, +infinity]
Sentiment = log(P + 0.5) - log(N + 0.5)
This tends to have the smoothest properties and is symmetric around zero. The 0.5 is a smoother to prevent log(0).
For details, please see William Lowe, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. (2011) "Scaling Policy Preferences From Coded Political Texts." Legislative Studies Quarterly 26(1, Feb): 123-155. where we compare their properties for measuring right-left ideology, but everything we discuss also applies to positive-negative sentiment.

you can use R tool for sentiment computation. here is the link you can refer to:
https://sites.google.com/site/miningtwitter/questions/sentiment/analysis

Related

The intuition behind the GlobalSpanCoefficient

I am looking at the vehicle routing problem which minimizes the cost of "the slowest truck" in a fleet.
So now the objective function should involve two quantities:
the sum of all transitions of all vehicles (total distance), and
the cost of the most expensive route
How are these values combined? I am assuming that the global span coefficient
distance_dimension.SetGlobalSpanCostCoefficient(100)
is involved? Is that the coefficient of a weighted sum
cost = w*A + (100-w)*B
where A is the cost of the slowest truck and B is the total distance of all trucks?
No it's simply: cost = B + A
with B = sum of all edge cost in the routes (usually set by using routing.SetArcCostEvaluatorOfAllVehicles(arc_cost_callback))
and A = w * (max{end} - min{start})
note: B is needed to help solver to find a first good solution (otherwise strategy like CHEAPEST_PATH behave strangely since there is no cost on edge to choose the cheapest...), While A helps to "distribute" jobs by minimizing the Max cumul Var. but it's still not a real dispersion incentive
e.g. supposing a dimension with cumul_start = 0 and 4 routes with cost 0,0,6,6 it is as good as 2,2,2,6 (or 6,6,6,6 but B is higher here).
i.e. max(cumul_end) == 6 in both cases.
I added a section on GlobalSpan here in the doc.
ps: take a look at https://github.com/google/or-tools/issues/685#issuecomment-388503464
pps: in the doc example try to change maximum_distance = 3000 by 1800 or 3500 if I remember well ;)
ppps: Note than you can have several GlobalSpan on several dimensions and objective is just the sum of all this costs multiply by their respective coefficient...

Why do we calculate cosine similarities using tf-idf weightings?

Suppose we are trying to measure similarity between two very similar documents.
Document A: "a b c d"
Document B: "a b c e"
This corresponds to a term-frequency matrix
a b c d e
A 1 1 1 1 0
B 1 1 1 0 1
where the cosine similarity on the raw vectors is the dot product of the two vectors A and B, divided by the product of their magnitudes:
3/4 = (1*1 + 1*1 + 1*1 + 1*0 + 1*0) / (sqrt(4) * sqrt(4)).
But when we apply an inverse document frequency transformation by multiplying each term in the matrix by (log(N / df_i), where N is the number of documents in the matrix, 2, and df_i is the number of documents in which a term is present, we get a tf-idf matrix of
a b c d e
A: 0 0 0 log2 0
B: 0 0 0 0 1og2
Since "a" appears in both documents, it has an inverse-document-frequency value of 0. This is the same for "b" and "c". Meanwhile, "d" is in document A, but not in document B, so it is multiplied by log(2/1). "e" is in document B, but not in document A, so it is also multiplied by log(2/1).
The cosine similarity between these two vectors is 0, suggesting the two are totally different documents. Obviously, this is incorrect. For these two documents to be considered similar to each other using tf-idf weightings, we would need a third document C in the matrix which is vastly different from documents A and B.
Thus, I am wondering whether and/or why we would use tf-idf weightings in combination with a cosine similarity metric to compare highly similar documents. None of the tutorials or StackOverflow questions I've read have been able to answer this question.
This post discusses similar failings with tf-idf weights using cosine similarities, but offers no guidance on what to do about them.
EDIT: as it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.
Since "a" appears in both documents, it has an inverse-document-frequency value of 0
This is where you have made an error in using inverse document frequency (idf). Idf is meant to be computed over a large collection of documents (not just across two documents), the purpose being to be able to predict the importance of term overlaps in document pairs.
You would expect that common terms, such as 'the', 'a' etc. overlap across all document pairs. Should that be having any contribution to your similarity score? - No.
That is precisely the reason why the vector components are multiplied by the idf factor - just to dampen or boost a particular term overlap (a component of the form a_i*b_i being added to the numerator in the cosine-sim sum).
Now consider you have a collection on computer science journals. Do you believe that an overlap of terms such as 'computer' and 'science' across a document pair is considered to be important? - No.
And this will indeed happen because the idf of these terms would be considerably low in this collection.
What do you think will happen if you extend the collection to scientific articles of any discipline? In that collection, the idf value of the word 'computer' will no longer be low. And that makes sense because in this general collection, you would like to think that two documents are similar enough if they are on the same topic - computer science.
As it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.

meaning V value, wilcoxen signed rank test

I have a question about my results of the Wilcoxon signed rank test:
My data consists of a trial with 2 groups (paired) in which a treatment was used. The results were scored in %. Groups consist of 131 people.
When I run the test in R, I got the following result:
wilcox.test(no.treatment, with.treatment, paired=T)
# Wilcoxon signed rank test with continuity correction
# data: no.treatment and with.treatment V = 3832, p-value = 0.7958
# alternative hypothesis: true location shift is not equal to 0
I am wondering what the V value means. I read somewhere that it has something to do with the number of positive scores (?), but I am wondering if it could tell me anything about the data and interpretation?
I'll give a little bit of background before answering your question.
The Wilcoxon signed rank sum test compares two values between the same N people (here 131), like for example blood values were measured for 131 people at two time points. The purpose of the test is to see whether the blood values have changed.
The V-statistic you are getting does not have a direct interpretation. This value is based on the pairwise difference between the individuals in your two groups. It is a value for a variable, that is supposed to follow a certain probability distribution. Intuitively speaking, you can say that the larger the value for V, the larger the difference between the two groups you sampled.
As always in hypothesis testing, you (well, the wilcox.test function) will calculate the probability that the value (V) of that variable is equal to 3832 or larger
prob('observing a value of 3832 or larger, when the groups are actually the same')
If there is really no difference between the two groups, the value for V will be 'close to zero'. Whether the value V you see is 'close to zero' depends on the probability distribution. The probability distribution is not straightforward for this variable, but luckily that doesn't matter since wilcoxon knows the distribution and calculates the probability for you (0.7958).
In short
Your groups do not significantly differ and V doesn't have a clear interpretation.
The V statistic produced by the function wilcox.test() can be calculated in R as follows:
# Create random data between -0.5 and 0.5
da_ta <- runif(1e3, min=-0.5, max=0.5)
# Perform Wilcoxon test using function wilcox.test()
wilcox.test(da_ta)
# Calculate the V statistic produced by wilcox.test()
sum(rank(abs(da_ta))[da_ta > 0])
The user MrFlick provided the above answer in reply to this question:
How to get same results of Wilcoxon sign rank test in R and SAS.
The Wilcoxon W statistic is not the same as the V statistic, and can be calculated in R as follows:
# Calculate the Wilcoxon W statistic
sum(sign(da_ta) * rank(abs(da_ta)))
The above statistic can be compared with the Wilcoxon probability distribution to obtain the p-value. There is no simple formula for the Wilcoxon distribution, but it can be simulated using Monte Carlo simulation.
The value of V does not mean the number of positive scores, but the sum of these positive scores.
As well there is a measurement for the sum for the negative scores, that this test does not provide. A brief script for calculating the sum for positive and for negative scores is provided in the following example:
a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)
diff <- c(a - b) #calculating the vector containing the differences
diff <- diff[ diff!=0 ] #delete all differences equal to zero
diff.rank <- rank(abs(diff)) #check the ranks of the differences, taken in absolute
diff.rank.sign <- diff.rank * sign(diff) #check the sign to the ranks, recalling the signs of the values of the differences
ranks.pos <- sum(diff.rank.sign[diff.rank.sign > 0]) #calculating the sum of ranks assigned to the differences as a positive, ie greater than zero
ranks.neg <- -sum(diff.rank.sign[diff.rank.sign < 0]) #calculating the sum of ranks assigned to the differences as a negative, ie less than zero
ranks.pos #it is the value V of the wilcoxon signed rank test
[1] 80
ranks.neg
[1] 40
CREDITS: https://www.r-bloggers.com/wilcoxon-signed-rank-test/
(They also provide a nice context for it.)
You can compare also both of these numbers to their average (in this case, 60), that would be the expected value for each side, i.e. positive ranks summing 60 and negative ranks summing 60 means complete equivalence of the sides. Do positive ranks summing 80 and negative ranks summing 40 can also be considered equivalent? (i.e. could we just attribute this difference of "20" to stochastic reasons or is this distant enough for us to reject the hypothesis of no-difference?)
So, as they explain, the critical interval for this case is [25,95]. Checking on a table for critical values for the Wilcoxon rank signed test, the critical value for this example is 25 (15 pairs at 5% on a two-tailed test; and 120-25 = 95...). Meaning that the interval [40,80] is not "big enough" to discard the possibility that the differences are purely due to random sampling. (Consistently, the p-value is above the alpha).
To compare the sum of positive scores to the sum of negative scores helps to determine the significance of the difference, it enriches the analysis. Also, the positive ranks themselves are input for the calculation of the p-value of the test, therefore the interest in them.
But to extract meaning from a simply reported sum of positive ranks (V), only, I think that is not straightforward. In terms of providing information, I believe that the least to do is to also check the sum of the negative ranks, too, to have a more consistent idea of what is happening. (of course, along with general info, like sample size, p-value, etc).
I, too, was confused about this seemingly mysterious "V" statistic. I realize there are already some helpful answers here, but I did not really understand them when I first read over them. So here I am explain it again in a way that I finally understood it. Hopefully it helps others if they are also still confused.
The V-statistic is the sum of ranks assigned to the differences with positive signs. Meaning, when you run a Wilcoxon Signed Rank test, it calculates a sum of negative ranks (W-) and a sum of positive ranks (W+). The test statistic (W) is usually the minimum value either (W-) or (W+), however the V-statistic is just going to be (W+).
To understand the importance of this, if the null hypothesis is true, (W+) and (W-) would be similar. This is because given the number of samples (n), your (W+) and (W-) will have a maximum possible combined value or, (W+)+(W-)=n(n+1)/2. If this maximum value is divided somewhat evenly, than there is not much of a difference between the paired sample sets and we accept the null. If there is a large difference between (W+) and (W-) than there is a large difference between the paired sample sets, and we have evidence for the alternative hypothesis. The degree of the difference and its significance relates to the critical value chart for W.
Here are particularly helpful sites to check out if the concept is still not 100%:
1.) https://mathcracker.com/wilcoxon-signed-ranks
2.) https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric6.html)
3.) https://www.youtube.com/watch?v=TqCg2tb4wJ0
TLDR; the V-statistic reported by R is the same as the W-statistic in cases where (W+) is the smaller of (W+) or (W-).

formula Amplitude using FFT

I want to ask about the formula of amplitude bellow. I am using Fast Fourier Transform. So it returns real and complex numbers.
after that I must search amplitude for each frequency.
My formula is
amplitude = 10 * log (real*real + imagined*imagined)
I want to ask about this formula. What is it source? I have been search, but I don't found any source. Can anybody tell me about that source?
This is a combination of two equations:
1: Finding the magnitude of a complex number (the result of an FFT at a particular bin) - the equation for which is
m = sqrt(r^2 + i ^2)
2: Calculating relative power in decibels from an amplitude value - the equation for which is p =10 * log10(A^2/Aref^2) == 20 log10(A/Aref) where Aref is a some reference value.
By inserting m from equation 1 into a from equation 2 with ARef = 1 we get:
p = 10 log(r^2 + i ^ 2)
Note that this gives you a measure of relative signal power rather than amplitude.
The first part of the formula likely comes from the definition of Decibel, with the reference P0 set to 1, assuming with log you meant a logarithm with base 10.
The second part, i.e. the P1=real^2 + imagined^2 in the link above, is the square of the modulus of the Fourier coefficient cn at the n-th frequency you are considering.
A Fourier coefficient is in general a complex number (See its definition in the case of a DFT here), and P1 is by definition the square of its modulus. The FFT that you mention is just one way of calculating the DFT. In your case, likely the real and complex numbers you refer to are actually the real and imaginary parts of this coefficient cn.
sqrt(P1) is the modulus of the Fourier coefficient cn of the signal at the n-th frequency.
sqrt(P1)/N, is the amplitude of the Fourier component of the signal at the n-th frequency (i.e. the amplitude of the harmonic component of the signal at that frequency), with N being the number of samples in your signal. To convince yourself you need to divide by N, see this equation. However, the division factor depends on the definition/convention of Fourier transform that you use, see the note just above here, and the discussion here.

Can the cosine similarity when using Locality Sensitive Hashing be -1?

I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.

Resources