Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)
Related
I have a dataframe like this:
df:
Score
group
A 100
A 34
A 40
A 30
C 24
C 60
C 35
For every group in the data, I want to find out the percentile value of Score 35.
(i.e the percentile where the 35 fits in the grouped data)
I tried different tricks but none of them worked.
scipy.stats.percentileofscore(df['Score], 35, kind='weak')
--> This is working but this doesn't give me the percentile grouped by index
df.groupby('group')['Score].percentileofscore()
--> 'SeriesGroupBy' object has no attribute 'percentileofscore'
scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
--> TypeError: '<' not supported between instances of 'str' and 'int'
My ideal output looks like this:
df:
Score Percentile
group
A 50
C 33
Can anyone suggest to me what works well here?
Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:
In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A 50.000000
C 33.333333
Name: Score, dtype: float64
get a True/False Series of whether < 35 or not on "Score"
group this Series over "group"
take the mean
since True == 1 and False == 0, it will effectively give the proportion!
multiply by 100 to get percentages
To answer in a bit more general purpose way you're looking to do a custom aggregation on the group, which pandas lets you do with the agg method.
You can define the function yourself or use one from a library:
def percentileofscore(ser: pd.Series) -> float:
return 100 * (ser > 35).sum() / ser.size
df.groupby("group").agg(percentileofscore)
Output:
Score
group
A 50.000000
C 33.333333
Im trying to find a way to predict high Results based on statistics data, thats why I need to find weights of parameters.
I have a data sample with following structure:
Price A
Price B
Price C
Result
5
4
9
80
2
3
0
30
On that structure I would like to calculate weights for
Price A , Price B, Price C to predict highest values of result column
Basing on my previous structure Price C seems to be most valued weight,
so weight for Price C should be the highest one.
So for data:
weightA * priceA + weightB * priceB + weightC * priceC = should generate Result value something around 80, but for second row there should be ~30 predicted
I've tried something with Pearson algorithm (in excel) but I didn't get 'good' results :(.
I have a dataframe.Structure:
SEQ product_name prod_cost non-prd_cost mgmt grand_total
1 prod1 100 200 20 320
2 prod2 200 400 30 630
3 prod3 300 500 40 840
4 prod4 100 300 50 450
I want to calculate sumproduct(in excel) based on condition.The condition is based on product_name.
lets say I want to calculate a variable called
sumprod_prod1_prd_prod3_mgmt = SUMPRODUCT(SEQ 1-4,product_name='prod1'_prod_cost and 'prod3'_mgmt)/2 = 100+40=140
How can I do this in pandas?
While I am a bit confused by your question, since the excel SUMPRODUCT function returns the sum of the products of corresponding ranges or arrays, and you seem to want the SUM of a singular combination.
To get the desired value:
sumprod_prod1_prd_prod3_mgmt = df[df['product_name'] == 'prod1']['prod_cost'].values[0]+df[df['prod_name']=='prod3']['mgmt'].values[0]
This solution gives a single result for the specified values. If you need a solution which provides the same functionality as excel, please update your question and example to better define what you are looking for.
This question already has an answer here:
Normalize rows of pandas data frame by their sums [duplicate]
(1 answer)
Closed 2 years ago.
I have a very high dimensional data with more than 100 columns. As an example, I am sharing the simplified version of it given as a below:
date product price amount
11/17/2019 A 10 20
11/24/2019 A 10 20
12/22/2020 A 20 30
15/12/2019 C 40 50
02/12/2020 C 40 50
I am trying to calculate the percentage of columns based on total row sum illustrated below:
date product price amount
11/17/2019 A 10/(10+20) 20/(10+20)
11/24/2019 A 10/(10+20) 20/(10+20)
12/22/2020 A 20/(20+30) 30/(20+30)
15/12/2019 C 40/(40+50) 50/(40+50)
02/12/2020 C 40/(40+50) 50/(40+50)
Is there any way to do this efficiently for high dimensional data? Thank you.
In addition to the provided link (Normalize rows of pandas data frame by their sums), you need to locate the specific columns as your first two column are non-numeric:
cols = df.columns[2:]
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0)
Out[1]:
date product price amount
0 11/17/2019 A 0.3333333333333333 0.6666666666666666
1 11/24/2019 A 0.3333333333333333 0.6666666666666666
2 12/22/2020 A 0.4 0.6
3 15/12/2019 C 0.4444444444444444 0.5555555555555556
4 02/12/2020 C 0.4444444444444444 0.5555555555555556
I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0