Seaborn boxplot quartile calculation - python-3.x

i am using seaborn version 0.7.1 for python. I am trying to create a boxplot for the below numpy array
arr = np.array([2, 4, 5, 5, 8, 8, 9])
from my understanding the Quartiles Q1 and Q3 should be 4 and 8 but from the boxplot generated the Q1 is approximately 4.5. What am i missing ?
i am using the follwing command to generate the chart
sns.boxplot(arr)

It would of course depend on the definition of a quartile.
Wikipedia mentions 3 methods to calculate the quartile,
method1: Take median of the lower part of the sample [2,4,5]. Result 4.
method2: Take median of the lower part of the sample (including its median) [2,4,5,5]. Result 4.5.
method3: The lower quartile is 75% of the second data value plus 25% of the third data value. Result: 4*0.75+5*0.25 = 4.25. (It's always the mean between method1 and 2.
You may also use numpy to calculate the quartiles
x = [2, 4, 5, 5, 8, 8, 9]
np.percentile(x, [25])
This returns 4.5

Related

How can I apply a function to multiple columns of grouped rows and make a column of the output?

I have a large excel file containing stock data loaded from and sorted from an API. Below is a sample of dummy data that is applicable towards solving this problem:
I would like to create a function that Scores Stocks grouped by their Industry. For this sample I would like to score a stock a 5 if its ['Growth'] value is less than its groups mean Growth (quantile, percentile) in Growth and a 10 if above the groups mean growth. The scores of all values in a column should be returned in a list
Current Unfinished Code:
import numpy as np
import pandas as pd
data = pd.DataFrame(pd.read_excel('dummydata.xlsx')
Desired input:
data['Growth'].apply(score) # Scores stock
Desired Output:
[5, 10, 10, 5, 10, 5]
If I can create a function for this sample then I will be able to make similar ones for different columns with slightly different conditions and aggregates (like percentile or quantile) that affect the score. I'd say the main problem here is accessing these grouped values and comparing them.
I don't think it's possible to convert from a Series to a list in the apply call. I may be wrong on that but if the desired output was changed slightly to
data['Growth'].apply(score).tolist()
then you can use a lambda function to do this.
score = lambda x: 5 if x < growth.mean() else 10
data['Growth'].apply(score).tolist() # outputs [5, 10, 10, 5, 10, 5]

Calculate the percent change between every rolling nth row in a Pandas DataFrame

How can I calculate the percentage change between every rolling nth row in a Pandas DataFrame? Using every 2nd row as an example:
Given the following Dataframe:
>df = pd.DataFrame({"A":[14, 4, 5, 4, 1, 55],
"B":[5, 2, 54, 3, 2, 32],
"C":[20, 20, 7, 21, 8, 5],
"D":[14, 3, 6, 2, 6, 4]})
I would like the resulting DataFrame to be:
But, the closest I am getting by using this code:
>df.iloc[::2,:].pct_change(-1)
Which results in this:
It is performing the calculation for every other row but this is not the same as the a rolling window of calculating every nth row. I came across a similar Stack post but that example is not very straightforward.
Also, as a bonus, I'd like to display the resulting output as a percentage to two decimal places.
Thank you for your time!
Got it! Use the option "periods" for 'pct_change()'.
>df.pct_change(periods=-n) #where n=2 for the given example.

What does bar plot compute in Y-axis in seaborn?

I am visualizing the titanic dataset. I created 9 different age categories and was trying to visualize the age_categories vs Survived using a bar chart. I wrote the following piece of code:
age_cats = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df_train['Age_Cats'] = pd.cut(df_train['Age'], 9, labels = age_cats)
sns.barplot(x = 'Age_Cats', y = 'Survived', hue = 'Sex', data = df_train)
I am not understanding what do the numbers on the Y-axis represent?
My assumption is:
{n(Survived = 1)}/{n(Survived = 1) + n(Survived = 0)} or the ratio of people survived out of all people in that category. But how is seaborn calculating it?
Or do the numbers on the Y-axis represent anything else?
The bar plot shows the survival rate or percentage of people who survived.
E.g. in the age class 1 60% of all males survived. In the age class 7 less than 15% of all males survived.
This is calculated by taking the mean of the survival variable for that age class. E.g. if you had 3 people, 2 of which survived, this variable could look like [1,0,1], the mean of this array is (1+0+1)/3=0.66; the bar plot would hence show a bar up to 0.66.

np.percentile not equal to quartiles

I'm trying to calculate the quartiles for an array of values in python using numpy.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile is doing to give a different answer. Any light shed on this, I'd be very grateful for.
There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

Statistics Question

Suppose I conduct a survey of 10 people asking whether to rank a movie as 0 to 4 stars. Allowable answers are 0, 1, 2, 3, and 4.
The mean is 2.0 stars.
How do I calculate the certainty (or uncertainty) about this 2.0 star rating? Ideally, I would like a number between 0 and 1, where 0 represents complete uncertainty and 1 represents complete certainty.
It seems clear that the case where the 10 people choose ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ) would be the most certain, while the case where the 10 people choose ( 0, 0, 0, 0, 0, 4, 4, 4, 4, 4 ) would be the least certain. ( 0, 1, 1, 2, 2, 2, 2, 3, 3, 4 ) would be somewhere in the middle.
The standard deviation does not have the properties requested. It is zero when everyone chooses the same answer, and can be as great as sqrt(40/9) = 2.11 when there are five 0s and five 4s.
I suggest you use 1-stdev(x)/sqrt(40/9) which will take value 1 when everyone agrees, and value 0 when there are five 0s and five 4s.
The function you're after here is the standard deviation.
The standard deviations of your three examples are 0 (meaning no deviation), 2.1 (large deviation) and 1.15 (in between).
What you want is called the standard deviation.
You should consider whether or not the mean value is an appropriate statistic for this kind of information. ie Is a movie rated 2 stars twice as good as one rated 4 stars?
You may be better served by using a percentile measure (such as the median) to represent the central tendency, and a percentile range (such as the IQR) to measure 'certainty'. As in the answers above, certainty would be greatest with a value of 0, as you are really making a measurement of deviation from the central tendency.
Incidentally, a survey of 10 people is too small to perform much in the way of meaningful statistical analysis.

Resources