how to calculate percentile value of number in dataframe column grouped by index - python-3.x

I have a dataframe like this:
df:
Score
group
A 100
A 34
A 40
A 30
C 24
C 60
C 35
For every group in the data, I want to find out the percentile value of Score 35.
(i.e the percentile where the 35 fits in the grouped data)
I tried different tricks but none of them worked.
scipy.stats.percentileofscore(df['Score], 35, kind='weak')
--> This is working but this doesn't give me the percentile grouped by index
df.groupby('group')['Score].percentileofscore()
--> 'SeriesGroupBy' object has no attribute 'percentileofscore'
scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
--> TypeError: '<' not supported between instances of 'str' and 'int'
My ideal output looks like this:
df:
Score Percentile
group
A 50
C 33
Can anyone suggest to me what works well here?

Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:
In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A 50.000000
C 33.333333
Name: Score, dtype: float64
get a True/False Series of whether < 35 or not on "Score"
group this Series over "group"
take the mean
since True == 1 and False == 0, it will effectively give the proportion!
multiply by 100 to get percentages

To answer in a bit more general purpose way you're looking to do a custom aggregation on the group, which pandas lets you do with the agg method.
You can define the function yourself or use one from a library:
def percentileofscore(ser: pd.Series) -> float:
return 100 * (ser > 35).sum() / ser.size
df.groupby("group").agg(percentileofscore)
Output:
Score
group
A 50.000000
C 33.333333

Related

How to change Pandas Column Values in List Format

I'm trying to multiply each value in a column by 0.01 but the column values are in list format. How do I apply it to each element of the list in each row? For example, my data looks like this:
ID Amount
156 [14587, 38581, 55669]
798 [67178, 98635]
And I'm trying to multiply each element in the lists by 0.01.
ID Amount
156 [145.87, 385.81, 556.69]
798 [671.78, 986.35]
I've tried the following code but got an error message saying "can't multiply sequence by non-int of type 'float'".
df['Amount'] = df3['Amount'].apply(lambda x: x*0.00000001 in x)
You need another loop / list comprehension in apply:
df['Amount'] = df.Amount.apply(lambda lst: [x * 0.01 for x in lst])
df
ID Amount
0 156 [145.87, 385.81, 556.69]
1 798 [671.78, 986.35]

calculate sumproduct(Excel) in pandas dataframe based on condition

I have a dataframe.Structure:
SEQ product_name prod_cost non-prd_cost mgmt grand_total
1 prod1 100 200 20 320
2 prod2 200 400 30 630
3 prod3 300 500 40 840
4 prod4 100 300 50 450
I want to calculate sumproduct(in excel) based on condition.The condition is based on product_name.
lets say I want to calculate a variable called
sumprod_prod1_prd_prod3_mgmt = SUMPRODUCT(SEQ 1-4,product_name='prod1'_prod_cost and 'prod3'_mgmt)/2 = 100+40=140
How can I do this in pandas?
While I am a bit confused by your question, since the excel SUMPRODUCT function returns the sum of the products of corresponding ranges or arrays, and you seem to want the SUM of a singular combination.
To get the desired value:
sumprod_prod1_prd_prod3_mgmt = df[df['product_name'] == 'prod1']['prod_cost'].values[0]+df[df['prod_name']=='prod3']['mgmt'].values[0]
This solution gives a single result for the specified values. If you need a solution which provides the same functionality as excel, please update your question and example to better define what you are looking for.

How to find correlation between two categorical variable num_chicken_pox and how many time vaccine given

The problem is how to find out the correlation between two categorical [series] items?
the situation is like that i have to find out the correlation between HAVING_CPOX and NUM_VECILLA_veccine
Given among children
the main catch is that in HAVING CPOX COLUMNS have 4 unique value
1-Having cpox
2-not having cpox
99- may be NULL
7 i don't know
in df['P_NUMVRC'] : unique value is [1, 2, 3, 0, Nan,]
two different distinct series SO how do find put them together and find the correlation
I use value_counts for having frequency of each?
1 13781
2 213
3 1
Name: P_NUMVRC, dtype: int64
For having_cpox columns
2 27955
1 402
77 105
99 3
Name: HAD_CPOX, dtype: int64
the requirement is like this
A positive correlation (e.g., corr > 0) means that an increase in had _ch
ickenpox_column (which means more no’s) would also increase the values of
um_chickenpox_vaccine_column (which means more doses of vaccine). If there
is a negative correlation (e.g., corr < 0), it indicates that having had
chickenpox is related to an increase in the number of vaccine doses.
I think what you are looking for is using np.corrcoef. It receives two (in your case - 1 dimensional) arrays, and returns the Pearson Correlation (for more details see: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html).
So basically:
valid_df = df.query('HAVING_CPOX < 3')
valid_df['HAVING_CPOX'].apply(lambda x: x == 1, inplace=True)
corr = np.corrcoef(valid_df['HAVING_CPOX'], valid_df['P_NUMVRC'])
What I did is first get rid of the 99's and 7's since you can't really rely on those. Then I changed the HAVING_CPOX to be binary (0 is "has no cpox" and 1 is "has cpox"), so that the correlation makes sense. Then I used corrcoef from numpy's implementation.

How to assign values from a list to a pandas dataframe and control the distribution/frequency each list element has in the dataframe

I am building a dataframe and need to assign values from a defined list to a new column in the dataframe. I have found an answer which gives a method to assign elements from a list randomly to a new column in a dataframe here (How to assign random values from a list to a column in a pandas dataframe?).
But I want to be able to control the distribution of the elements in my list within the new dataframe by either assigning a frequency of occurrences or some other method to control how many times each list element appears in the dataframe.
For example, if I have a list my_list = [50, 40, 30, 20, 10] how can I say that for a dataframe (df) with n number of rows assign 50 to 10% of the rows, 40 to 20%, 30 to 30%, 20 to 35% and 10 to 5% of the rows.
Any other method to control for the distribution of list elements is welcome, the above is a simple explanation to illustrate how one way to be able to control frequency may look.
You can use choice function from numpy.random, providing probability distribution.
>>> a = np.random.choice([50, 40, 30, 20, 10], size=100, p=[0.1, 0.2, 0.3, 0.35, 0.05])
>>> pd.Series(a).value_counts().sort_index(ascending=False)
50 9
40 25
30 19
20 38
10 9
dtype: int64
Just put the desired size into size parameter (dataframe's length)

Resampling on non-time related buckets

Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)

Resources