how to get percentage of columns to sum of row in python [duplicate] - python-3.x

This question already has an answer here:
Normalize rows of pandas data frame by their sums [duplicate]
(1 answer)
Closed 2 years ago.
I have a very high dimensional data with more than 100 columns. As an example, I am sharing the simplified version of it given as a below:
date product price amount
11/17/2019 A 10 20
11/24/2019 A 10 20
12/22/2020 A 20 30
15/12/2019 C 40 50
02/12/2020 C 40 50
I am trying to calculate the percentage of columns based on total row sum illustrated below:
date product price amount
11/17/2019 A 10/(10+20) 20/(10+20)
11/24/2019 A 10/(10+20) 20/(10+20)
12/22/2020 A 20/(20+30) 30/(20+30)
15/12/2019 C 40/(40+50) 50/(40+50)
02/12/2020 C 40/(40+50) 50/(40+50)
Is there any way to do this efficiently for high dimensional data? Thank you.

In addition to the provided link (Normalize rows of pandas data frame by their sums), you need to locate the specific columns as your first two column are non-numeric:
cols = df.columns[2:]
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0)
Out[1]:
date product price amount
0 11/17/2019 A 0.3333333333333333 0.6666666666666666
1 11/24/2019 A 0.3333333333333333 0.6666666666666666
2 12/22/2020 A 0.4 0.6
3 15/12/2019 C 0.4444444444444444 0.5555555555555556
4 02/12/2020 C 0.4444444444444444 0.5555555555555556

Related

find average variance across columns & average of variance across rows in python

I have the data as follows:-
a b c d
5 6 32 12
9 8 16 23
15 8 14 20
I want to check the variance of each column and then have 1 variance for each column. Then I would like to get average of variance across columns & finally reach one number for this entire dataset.
Can anyone please help?
Do you want var and mean?
df.var().mean()
output: 39.083333333333336
intermediate:
df.var()
a 25.333333
b 1.333333
c 97.333333
d 32.333333
dtype: float64
NB. by default, var used 1 degree of freedom (ddof=1) to compute the variance. Id you want 0, use df.var(ddof=0).

Sorting data frames with time in hours days months format

I have one data frame with time in columns but not sorted order, I want to sort in ascending order, can some one suggest any direct function or code for data frames sort time.
My input data frame:
Time
data1
1 month
43.391588
13 h
31.548372
14 months
41.956652
3.5 h
31.847388
Expected data frame:
Time
data1
3.5 h
31.847388
13 h
31.847388
1 month
43.391588
14 months
41.956652
You need replace units to numbers first by Series.replace and then convert to numeric by pandas.eval, last use this column for sorting by DataFrame.sort_values:
d = {' months': '*30*24', ' month': '*30*24', ' h': '*1'}
df['sort'] = df['Time'].replace(d, regex=True).map(pd.eval)
df = df.sort_values('sort')
print (df)
Time data1 sort
3 3.5 h 31.847388 3.5
1 13 h 31.548372 13.0
0 1 month 43.391588 720.0
2 14 months 41.956652 10080.0
Firstly you have to assert the type of data you have in your dataframe.
This will indicate how you may proceed.
df.dtypes or at your case df.index.dtypes .
Preferred option for sorting dataframes is df.sort_values()

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Resampling on non-time related buckets

Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)

Resources