np.percentile not equal to quartiles - python-3.x

I'm trying to calculate the quartiles for an array of values in python using numpy.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile is doing to give a different answer. Any light shed on this, I'd be very grateful for.

There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

Related

Way to see the missing columns after calculating the minimum value in Pandas

Using pandas, I grouped a dataset by window number(winnum), latitude, and longitude.
The code is as follows.
final=[(win[j],ttdf[0][i],ttdf[1][i],(ttdf[2][i]-shift[j])**2) for i in range(len(ttdf))
for j in range(len(ccdf))]
fidf=pd.DataFrame(final)
winnum=fidf[0]
latitue=fidf[1]
longitude=fidf[2]
difference=fidf[3]
titles = {0: 'winnum', 1: 'latitude', 2: 'longitude', 3: 'difference'}
fidf.rename(columns=titles, inplace=True)
Then I summed the difference value of each group to find the minimum value for each set of latitude and longitude.
grouped=fidf['difference'].groupby([fidf['winnum'],fidf['latitude'],fidf['longitude']])
s=grouped.sum()
lastdf=pd.DataFrame(s)
lastdf.min(level='winnum')
However, if I type the code above, I can only see two columns, which are 'winnum' and the minimum value of the sum of 'difference'.
What I want to do is to check the value of (latitude,longitude) which has the minimum value of 'difference sum' for each winnum.
Is there any way that I can see latitude and longitude columun here even after I calculate the minimum value of difference sum?
It would be a great help for me if you give me the answer. Thanks :)
Your groupby should result in Series with the sum of the differences. The winnum, latitude and longitude are still present in the index though.
Example:
fidf = pd.DataFrame({'winnum': [0,0,1,2],
'latitude': [1, 1, 2, 2],
'longitude': [3, 3, 4, 5],
'difference': [1, 2, 3, 4]})
grouped = fidf['difference'].groupby([fidf.winnum,
fidf.latitude,
fidf.longitude]).sum()
print(grouped.index.names)
# ['winnum', 'latitude', 'longitude']
You can get the index values for the minimum sum of differences with idxmin
winmum, lat, long = grouped.idxmin()
#(0, 1, 3)
If you want the row for each winnum with the minimum sum of difference, you can use the following lookup:
grouped.loc[grouped.groupby('winnum').idxmin()]
There's definitely a smarter way of keeping the value - however, why not simply add it back? Once you have lastdf:
lastdf = lastdf.reset_index()
lastdf.merge(fidf,how='left',on=['winnum','difference'])
This should just take the lat and lon of lines in fidf with the same winnum and difference and add it on.

Seaborn boxplot quartile calculation

i am using seaborn version 0.7.1 for python. I am trying to create a boxplot for the below numpy array
arr = np.array([2, 4, 5, 5, 8, 8, 9])
from my understanding the Quartiles Q1 and Q3 should be 4 and 8 but from the boxplot generated the Q1 is approximately 4.5. What am i missing ?
i am using the follwing command to generate the chart
sns.boxplot(arr)
It would of course depend on the definition of a quartile.
Wikipedia mentions 3 methods to calculate the quartile,
method1: Take median of the lower part of the sample [2,4,5]. Result 4.
method2: Take median of the lower part of the sample (including its median) [2,4,5,5]. Result 4.5.
method3: The lower quartile is 75% of the second data value plus 25% of the third data value. Result: 4*0.75+5*0.25 = 4.25. (It's always the mean between method1 and 2.
You may also use numpy to calculate the quartiles
x = [2, 4, 5, 5, 8, 8, 9]
np.percentile(x, [25])
This returns 4.5

Pyspark Columnsimilarities interpretation

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you
how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

random numbers from geometric distribution such that their sum equals SUM

I want to draw k random numbers i_1,...,i_k with min <= i <= max from an exponentially shaped distribution of values with m,std being median and standard of the population's values. The sum(i1,..,ik) should equal a given parameter SUM.
Example:
Given:
k = 9 SUM = 175 min = 8 max = 40 m = 14
Desired:
[9, 10, 11, 12, 14, 17, 23, 30, 39]
I don't know if this is actually possible without depending on luck to draw a combination satisfying the SUM rule. I'd appreciate any kind of help or comment. Thank you.
EDIT: In a former version I wrote about exponentional distributions where an exact solution is impossible, rather I meant an exponentially shaped distribution with discrete values like a geometric distribution for instance.
EDIT2: Corrected the number k in the example.

Statistics Question

Suppose I conduct a survey of 10 people asking whether to rank a movie as 0 to 4 stars. Allowable answers are 0, 1, 2, 3, and 4.
The mean is 2.0 stars.
How do I calculate the certainty (or uncertainty) about this 2.0 star rating? Ideally, I would like a number between 0 and 1, where 0 represents complete uncertainty and 1 represents complete certainty.
It seems clear that the case where the 10 people choose ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ) would be the most certain, while the case where the 10 people choose ( 0, 0, 0, 0, 0, 4, 4, 4, 4, 4 ) would be the least certain. ( 0, 1, 1, 2, 2, 2, 2, 3, 3, 4 ) would be somewhere in the middle.
The standard deviation does not have the properties requested. It is zero when everyone chooses the same answer, and can be as great as sqrt(40/9) = 2.11 when there are five 0s and five 4s.
I suggest you use 1-stdev(x)/sqrt(40/9) which will take value 1 when everyone agrees, and value 0 when there are five 0s and five 4s.
The function you're after here is the standard deviation.
The standard deviations of your three examples are 0 (meaning no deviation), 2.1 (large deviation) and 1.15 (in between).
What you want is called the standard deviation.
You should consider whether or not the mean value is an appropriate statistic for this kind of information. ie Is a movie rated 2 stars twice as good as one rated 4 stars?
You may be better served by using a percentile measure (such as the median) to represent the central tendency, and a percentile range (such as the IQR) to measure 'certainty'. As in the answers above, certainty would be greatest with a value of 0, as you are really making a measurement of deviation from the central tendency.
Incidentally, a survey of 10 people is too small to perform much in the way of meaningful statistical analysis.

Resources