I have a transaction dataframe as under:
Item Date Code Qty Price Value
0 A 01-01-01 Buy 10 100.5 1005.0
1 A 02-01-01 Buy 5 120.0 600.0
2 A 03-01-01 Sell 12 125.0 1500.0
3 A 04-01-01 Buy 9 110.0 990.0
4 A 04-01-01 Sell 1 100.0 100.0
#and so on... there are a million rows with about thousand items (here just one item A)
What I want is to map each selling transaction against purchase transaction in a sequential manner of FIRST IN FIRST OUT. So, the purchase that was made first will be sold out first.
For this, I have added a new column bQty with opening balance same as purchase quantity. Then I iterate through the dataframe for each sell transaction to set the sold quantity off against purchase transaction before that date.
df['bQty'] = df[df['Code']=='Buy']['Quantity']
for each in df[df['Code']=='Sell']:
for each in df[(df['Code']=='Buy') & (df['Date'] <= sellDate)]:
#code#
Now this requires me to go through the whole dataframe again and again for each sell transaction.
For 1000 records it takes about 10 seconds to complete. So, we can assume that for a million records, this approach will take a lot time.
Is there any faster way to do this?
If you are only interested in the resulting final balance values per item, here is a fast way to calculate them:
Add two additional columns that contain the same absolute values as Qty and Value, but with a negative sign in those rows where the Code value is Sell. Then you can group by item and sum these values for each item, to get the remaining number of items and the money spent for them on balance.
sale = df.Code == 'Sell'
df['Qty_signed'] = df.Qty.copy()
df.loc[sale, 'Qty_signed'] *= -1
df['Value_signed'] = df.Value.copy()
df.loc[sale, 'Value_signed'] *= -1
qty_remaining = df.groupby('Item')['Qty_signed'].sum()
print(qty_remaining)
money_spent = df.groupby('Item')['Value_signed'].sum()
print(money_spent)
Output:
Item
A 11
Name: Qty_signed, dtype: int64
Item
A 995.0
Name: Value_signed, dtype: float64
Related
Forgive me if this is a repeat question, but I can't find the answer and I'm not even sure what the right terminology is.
I have two dataframes that don't have completely matching rows or columns. Something like:
Balances = pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[500,5000,300,100,3000],
'Payment Due Date':[1,1,30,14,1]})
Payments = pd.DataFrame({'Name':['Debbie','Alan','Carl'],
'Balance':[50,100,30]})
I want to subtract the Payments dataframe from the Balances dataframe based on Name, so essentially a new dataframe that looks like this:
pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[400,5000,270,50,3000],
'Payment Due Date':[1,1,30,14,1]})
I can imagine having to iterate over the rows of Balances, but when both dataframes are very large I don't think it's very efficient.
You can use .merge:
tmp = pd.merge(Balances, Payments, on="Name", how="outer").fillna(0)
Balances["Balance"] = tmp["Balance_x"] - tmp["Balance_y"]
print(Balances)
Prints:
Name Age Of Debt Balance Payment Due Date
0 Alan 1 400.0 1
1 Barry 4 5000.0 1
2 Carl 3 270.0 30
3 Debbie 7 50.0 14
4 Elaine 2 3000.0 1
Suppose I have this (randomic) df_bnb:
neighbourhood room_type price minimum_nights
0 Allen Pvt room 38 5
1 Arder Entire home/apt 90 2
2 Arrochar Entire home/apt 90 2
3 Belmont Shared Room 15 1
4 City Island Entire home/apt 100 3
Every row represents an Airbnb's booking.
I hope to generate a pivot_table in which Index is the column neighbourhood and columns are others data frame columns ['room_type', 'price', 'minimun_nights'].
I want entries of the abovementioned columns as mean, expect for room_type where I wish to have the mode. Like the following dataframe's example:
room_type price minimum_nights
Allen room type mode for Allen price mean for Allen mean min nights for Allen
Arder room type mode for Arder price mean for Arder mean min nights for Arder
Arrochar room type mode for Arrochar price mean for Arrochar mean of min nights for Arrochar
Belmont room type mode for Belmont price mean for Belmont mean of min nights for Belmont
City Island room type mode for City Island price mean fot City Is. mean of min nights for City Island
This is the code I try so far:
bnb_pivot = pd.pivot_table(bnb,
index = ['neighborhood'],
values = ['room_type', 'price',
'minimum_nights','number_of_reviews'],
aggfunc = {'room_type': statistics.mode,
'price' : np.mean,
'minimum_nights': np.mean,
'number_of_reviews': np.mean})
This is the error that I am getting:
StatisticsError: no unique mode; found 2 equally common values
I try to search for other sources, but I don't how to treat statistic.mode() while creating a pivot_table.
Many thanks in advance for any helpful indication!
My data set contains house price for 4 different house types (A,B,C,D) in 4 different countries (USA, Germany, Uk, sweden). House price can be only three types (Upward, Downward, and Not Changed). I want to calculate Difition index (ID) for different House types (A,B,C,D) for different countries (USA, Germany, Uk, sweden) based on house price.
The formula that I want to use to calculate Difition index (DI) is:
DI = (Total Number of Upward * 1 + Total Number of Downward * 0 + Total Number of Not Changed * 0.5) / (Total Number of Upward + Total Number of Downward + Total Number of Not Changed)
Here is my data:
and the expected result is:
I really need your help.
Thanks.
You can do this by using groupby and assuming your file is named as text.xlsx
df = pd.read_excel('test.xlsx')
df = df.replace({'Upward':1,'Downward':0,'Notchanged':0.5})
df.groupby('Country').mean().reset_index()
Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)
I have a spreadsheet that is updated with different purchase orders in it.
I basically need to show the to show the total for each individual purchase order.
For example I have order 10,11,12 etc with different amounts entered.
The only problem is that the orders come in different orders.
I need a formula that will total the different purchase orders even tho they come in a random order.
They are all in the same column however, as well as the price.
Thanks in advance.
Leigh
You can use SUMIF
A B C D
1 OrderNo Price OrderNo OrderTotal
2 10 25 10 =SUMIF($A$2:$A$10,"=" & C2, $B$2:$B$10) // =175
3 12 100 11 =SUMIF($A$2:$A$10,"=" & C3, $B$2:$B$10) // =100
4 10 50 12 =SUMIF($A$2:$A$10,"=" & C4, $B$2:$B$10) // =200
5 11 10
6 10 50
7 12 100
8 11 75
9 11 15
10 10 50
You should consider a pivot table. Put the Order No in the Row field and Sum of Price in the data field.
Hm. Are we talking about the parts of a purchase order being separated from one another in the spreadsheet? I'm not sure I understand the question...
If you have an order ID or other unique field, you might be able to use VLOOKUP (but only if there's just row for each order). You might also be able to sort by order number and use SUM.