Statistics Question - statistics

Statistics Question - statistics

Suppose I conduct a survey of 10 people asking whether to rank a movie as 0 to 4 stars. Allowable answers are 0, 1, 2, 3, and 4.
The mean is 2.0 stars.
How do I calculate the certainty (or uncertainty) about this 2.0 star rating? Ideally, I would like a number between 0 and 1, where 0 represents complete uncertainty and 1 represents complete certainty.
It seems clear that the case where the 10 people choose ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ) would be the most certain, while the case where the 10 people choose ( 0, 0, 0, 0, 0, 4, 4, 4, 4, 4 ) would be the least certain. ( 0, 1, 1, 2, 2, 2, 2, 3, 3, 4 ) would be somewhere in the middle.

The standard deviation does not have the properties requested. It is zero when everyone chooses the same answer, and can be as great as sqrt(40/9) = 2.11 when there are five 0s and five 4s.
I suggest you use 1-stdev(x)/sqrt(40/9) which will take value 1 when everyone agrees, and value 0 when there are five 0s and five 4s.

The function you're after here is the standard deviation.
The standard deviations of your three examples are 0 (meaning no deviation), 2.1 (large deviation) and 1.15 (in between).

What you want is called the standard deviation.

You should consider whether or not the mean value is an appropriate statistic for this kind of information. ie Is a movie rated 2 stars twice as good as one rated 4 stars?
You may be better served by using a percentile measure (such as the median) to represent the central tendency, and a percentile range (such as the IQR) to measure 'certainty'. As in the answers above, certainty would be greatest with a value of 0, as you are really making a measurement of deviation from the central tendency.
Incidentally, a survey of 10 people is too small to perform much in the way of meaningful statistical analysis.

Related

Flattern data by count

I have some data like this:
# of jobs count
--------- -----
1 2
2 3
3 1
4 1
They represent a multiset {1, 1, 2, 2, 2, 3, 4}. I want to do statistics (medium, mean, mode) against the multiset. Is there any way to somehow generate this list from the 4x2 data above? That way I can use the built-in functions (MEDIUM/AVERAGE/MODE). Currently I’m using SUMPRODUCT to calculate mean, and a combination of MAX/MATCH/INDEX to get the mode, but I can’t figure out a way to calculate medium.
Note:
Of course the real data is much more than 4 rows, but the idea should be the same.
The first column is sorted integers, if that helps.
It’s OK to use some auxiliary cells to hold intermediate data.
It doesn’t have to be a formula; if pivot table is a better tool, please advise.

With access to CONCAT you could use:
=FILTERXML(CONCAT("<t><s>",REPT(A2:A5&"</s><s>",B2:B5),"</s></t>"),"//s[node()]")
This would return {1, 1, 2, 2, 2, 3, 4} and you could directly apply the other functions, e.g.:
=MEDIAN(FILTERXML.....) etc.

Seaborn boxplot quartile calculation

i am using seaborn version 0.7.1 for python. I am trying to create a boxplot for the below numpy array
arr = np.array([2, 4, 5, 5, 8, 8, 9])
from my understanding the Quartiles Q1 and Q3 should be 4 and 8 but from the boxplot generated the Q1 is approximately 4.5. What am i missing ?
i am using the follwing command to generate the chart
sns.boxplot(arr)

It would of course depend on the definition of a quartile.
Wikipedia mentions 3 methods to calculate the quartile,
method1: Take median of the lower part of the sample [2,4,5]. Result 4.
method2: Take median of the lower part of the sample (including its median) [2,4,5,5]. Result 4.5.
method3: The lower quartile is 75% of the second data value plus 25% of the third data value. Result: 4*0.75+5*0.25 = 4.25. (It's always the mean between method1 and 2.
You may also use numpy to calculate the quartiles
x = [2, 4, 5, 5, 8, 8, 9]
np.percentile(x, [25])
This returns 4.5

np.percentile not equal to quartiles

I'm trying to calculate the quartiles for an array of values in python using numpy.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile is doing to give a different answer. Any light shed on this, I'd be very grateful for.

There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

Excel giving wrong average

Hi I have an average function:
=IF(ISERROR(AVERAGE(H6:H31)), "", AVERAGE(H6:H31))
but it returns the wrong average for the numbers: 0, 0, 3, 0, 0, 0, 0, 0, 4, 0
It produces 0.7 instead of 3.5 and I am definitely using column H row 6 to 31
What could cause this? Thanks

0.7 is the correct answer. See here
You are looking for the average excluding zeros. In which case you should use the AVERAGEIF function. In your case that would be:
=AVERAGEIF(H6:H31,"<>0")
This will give you 3.5

Average is considered as
Grand Total / Total no of Obs.
In this case you have total of 10 observations.
so, 7/10 is 0.7

as said above. Excel is correct. This should get you what you want though:
=SUMIF(H6:H31, "<>0")/COUNTIF(H6:H31, "<>0")

find the longest increasing subsequence (LIS)

Given A= {1,4,2,9,7,5,8,2}, find the LIS. Show the filled dynamic programming table and how the solution is found.
My book doesnt cover LIS so im a bit lost on how to start. For the DP table, ive done something similar with Longest Common Subsequences. Any help on how to start this would be much appreciated.

Already plenty of answers on this topic but here's my walkthrough, I view this site as a repository of answers for future posterity and this is just to provide additional insight when I worked through it myself.
The longest Increasing Subsequence (LIS) problem is to find the length of the longest subsequence of a given sequence such that all elements of the
subsequence are sorted in increasing order. For example, length of LIS for
{ 10, 22, 9, 33, 21, 50, 41, 60, 80 } is 6 and LIS is {10, 22, 33, 50, 60, 80}.
Let S[pos] be defined as the smallest integer that ends an increasing sequence of length pos.
Now iterate through every integer X of the input set and do the following:
If X > last element in S, then append X to the end of S. This essentialy means we have found a new largest LIS.
Otherwise find the smallest element in S, which is >= than X, and change it to X. Because S is sorted at any time, the element can be found
using binary search in log(N).
Total runtime - N integers and a binary search for each of them - N * log(N) = O(N log N)
Now let's do a real example:
Set of integers: 2 6 3 4 1 2 9 5 8
Steps:
0. S = {} - Initialize S to the empty set
1. S = {2} - New largest LIS
2. S = {2, 6} - 6 > 2 so append that to S
3. S = {2, 3} - 6 is the smallest element > 3 so replace 6 with 3
4. S = {2, 3, 4} - 4 > 3 so append that to s
5. S = {1, 3, 4} - 2 is the smallest element > 1 so replace 2 with 1
6. S = {1, 2, 4} - 3 is the smallest element > 2 so replace 3 with 2
7. S = {1, 2, 4, 9} - 9 > 4 so append that to S
8. S = {1, 2, 4, 5} - 9 is the smallest element > 5 replace 9 with 5
9. S = {1, 2, 4, 5, 8} - 8 > 5 so append that to S
So the length of the LIS is 5 (the size of S).
Let's take some other sequences to see that this will cover all possible caveats, each presents its own issue
say we have 1,2,3,4,9,2,3,4,5,6,7,8,10
basically it builds out 12349 first, then 2 will replace 3, 3 will replace 4, 4 will replace 9, then append 5,6,7,8,10
so will look like 1,2,2,3,4,6,7,8,10
take the other case we have 1,2,3,4,5,9,2,10
this will give us 1,2,2,4,5,9,10
or take the case we have 1,2,3,4,5,9,6,7,8,10
this will give us 1,2,3,4,5,7,8,10
so that kind of illuminates what goes on, in the first case the critical juncture being what happens when you hit the 2 after the 9,
how do you deal with these. well the block of 2,3,4 won't do anything really, when you hit 5 you replace the 9 because the 5 and 9
are virtually indifferentiable 9 ends the block of the first 5 increasing elements, you replace 9 with 5 because 5 is smaller so there
is greater potential to hit something > 5 later on. but you only replace the smallest element > itself. for ex. in the last case,
if your 6 doesn't replace 9 but instead replaces 1 and 7 replaces 2 and 8 replaces 3, then we get a final array of 7 elements instead
of 9. So just do a couple of these and figure out the pattern, this logic isn't the easiest to translate to paper.

There's a very strong relation between LIS and LCS.
http://en.wikipedia.org/wiki/Longest_increasing_subsequence
This article explains it pretty well I think. Basically the idea is, you can reduce one problem to the other (this is the case in many situations involving Dynamic programming).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Statistics Question - statistics

The function you're after here is the standard deviation. The standard deviations of your three examples are 0 (meaning no deviation), 2.1 (large deviation) and 1.15 (in between).

What you want is called the standard deviation.

Related

Flattern data by count

Seaborn boxplot quartile calculation

np.percentile not equal to quartiles

Excel giving wrong average

find the longest increasing subsequence (LIS)

Categories

Resources