I am visualizing the titanic dataset. I created 9 different age categories and was trying to visualize the age_categories vs Survived using a bar chart. I wrote the following piece of code:
age_cats = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df_train['Age_Cats'] = pd.cut(df_train['Age'], 9, labels = age_cats)
sns.barplot(x = 'Age_Cats', y = 'Survived', hue = 'Sex', data = df_train)
I am not understanding what do the numbers on the Y-axis represent?
My assumption is:
{n(Survived = 1)}/{n(Survived = 1) + n(Survived = 0)} or the ratio of people survived out of all people in that category. But how is seaborn calculating it?
Or do the numbers on the Y-axis represent anything else?
The bar plot shows the survival rate or percentage of people who survived.
E.g. in the age class 1 60% of all males survived. In the age class 7 less than 15% of all males survived.
This is calculated by taking the mean of the survival variable for that age class. E.g. if you had 3 people, 2 of which survived, this variable could look like [1,0,1], the mean of this array is (1+0+1)/3=0.66; the bar plot would hence show a bar up to 0.66.
Related
I have a large excel file containing stock data loaded from and sorted from an API. Below is a sample of dummy data that is applicable towards solving this problem:
I would like to create a function that Scores Stocks grouped by their Industry. For this sample I would like to score a stock a 5 if its ['Growth'] value is less than its groups mean Growth (quantile, percentile) in Growth and a 10 if above the groups mean growth. The scores of all values in a column should be returned in a list
Current Unfinished Code:
import numpy as np
import pandas as pd
data = pd.DataFrame(pd.read_excel('dummydata.xlsx')
Desired input:
data['Growth'].apply(score) # Scores stock
Desired Output:
[5, 10, 10, 5, 10, 5]
If I can create a function for this sample then I will be able to make similar ones for different columns with slightly different conditions and aggregates (like percentile or quantile) that affect the score. I'd say the main problem here is accessing these grouped values and comparing them.
I don't think it's possible to convert from a Series to a list in the apply call. I may be wrong on that but if the desired output was changed slightly to
data['Growth'].apply(score).tolist()
then you can use a lambda function to do this.
score = lambda x: 5 if x < growth.mean() else 10
data['Growth'].apply(score).tolist() # outputs [5, 10, 10, 5, 10, 5]
I have a map (2D matrix) of observations. Grid boxes without an observed value is assigned a NaN value.
I would like to use the zoom function in python to upscale the size of the grid boxes from 1°x1° to 10°x10°. When I do that, I want to ignore NaN values. For example, in an extreme scenario, if I have 100 1°x1° grid boxes where only one 1°x1° grid box contains an observation and the other 99 1°x1° grid boxes contain NaN, then I want the zoomed out 10°x10° to only take on the value of the single 1°x1° grid box that contained an observation.
Does anyone have a solution for this problem? Let me know if the question is not clear!
Note that it is only in my dataset that the fill value for no observation is NaN. The fill value could be assigned any value. But I want to be able to zoom the grid box sizes with the zooming ignoring the fill values.
Below is an example code that does not do what I want, since the zoom function assigns a NaN value to the 10°x10° grid box if there is one present in the 100 1°x1° it is built from. But I included it as an attempt to illustrate the problem:
import numpy.ma as ma
import numpy as np
from scipy.ndimage.interpolation import zoom
# Create the fake matrix
A = np.random.uniform(0, 10, 10000)
A[A < 5] = np.nan
A = A.reshape(100,100)
# Print the appearance of A
print('A:')
print(A)
# Zoom out to 10 times larger grid boxes
B = zoom(A,(1/10,1/10),order=1)
# Print the appearance of B
print('B:')
print(B)
i am using seaborn version 0.7.1 for python. I am trying to create a boxplot for the below numpy array
arr = np.array([2, 4, 5, 5, 8, 8, 9])
from my understanding the Quartiles Q1 and Q3 should be 4 and 8 but from the boxplot generated the Q1 is approximately 4.5. What am i missing ?
i am using the follwing command to generate the chart
sns.boxplot(arr)
It would of course depend on the definition of a quartile.
Wikipedia mentions 3 methods to calculate the quartile,
method1: Take median of the lower part of the sample [2,4,5]. Result 4.
method2: Take median of the lower part of the sample (including its median) [2,4,5,5]. Result 4.5.
method3: The lower quartile is 75% of the second data value plus 25% of the third data value. Result: 4*0.75+5*0.25 = 4.25. (It's always the mean between method1 and 2.
You may also use numpy to calculate the quartiles
x = [2, 4, 5, 5, 8, 8, 9]
np.percentile(x, [25])
This returns 4.5
I'm trying to calculate the quartiles for an array of values in python using numpy.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile is doing to give a different answer. Any light shed on this, I'd be very grateful for.
There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
In Excel 2010 is it possible to have X and Y categories in a scatter/line graph?
An example would be Simple, Intermediate, Complex on the X axis and Low, Medium, High on the Y axis and three markers in the plot area corresponding to Simple/Low, Intermediate/Medium and Complex/High.
Thanks.
You have to get some numbers in your dataset. A scatter between to categories is in my opinion can't be plotted and also it doesn't have any utility as such.
You can have your categories on one axis (Say X) and some values in another axis( Say Y), then you can plot the graph.
Your categories should be unique for scatters, if a single category comes more than once excel will auto change the categories to number from 1, 2, 3, ....