Efficient way to perform iterative subtraction and division operations on pandas columns - python-3.x

I have a following dataframe-
A B C Result
0 232 120 9 91
1 243 546 1 12
2 12 120 5 53
I want to perform the operation of the following kind-
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
which I am doing using
df['A-B/A+B']=(df['A']-df['B'])/(df['A']+df['B'])
df['A-C/A+C']=(df['A']-df['C'])/(df['A']+df['C'])
df['B-C/B+C']=(df['B']-df['C'])/(df['B']+df['C'])
which I believe is a very crude and ugly way to do.
How to do it in a more correct way?

You can do the following:
# take columns in a list except the last column
colnames = df.columns.tolist()[:-1]
# compute
for i, c in enumerate(colnames):
if i != len(colnames):
for k in range(i+1, len(colnames)):
df[c + '_' + colnames[k]] = (df[c] - df[colnames[k]]) / (df[c] + df[colnames[k]])
# check result
print(df)
A B C Result A_B A_C B_C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000

This is a perfect case to use DataFrame.eval:
cols = ['A-B/A+B','A-C/A+C','B-C/B+C']
x = pd.DataFrame([df.eval(col).values for col in cols], columns=cols)
df.assign(**x)
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 351.482759 786.753086 122.000000
1 243 546 1 12 240.961207 243.995885 16.583333
2 12 120 5 53 128.925000 546.998168 124.958333
The advantage of this method respect to the other solution, is that it does not depend on the order of the operation sings that appear as column names, but rather as mentioned in the documentation it is used to:
Evaluate a string describing operations on DataFrame columns.

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Pandas multi-index subtract from value based on value in other column part 2

Based on a thorough and accurate response to this question, I am now faced with a new issue based on slightly different data.
Given this data frame:
df = pd.DataFrame({
('A', 'a'): [23,3,54,7,32,76],
('B', 'b'): [23,'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
df
A B possible
a b possible
0 23 23 100
1 3 n/a 100
2 54 54 100
3 7 n/a 100
4 32 32 100
5 76 76 100
I'd like to subtract 4 from 'possible', per row, for any instance (column) where the value is 'n/a' for that row (and then change all 'n/a' values to 0).
A B possible
a b possible
0 23 23 100
1 3 n/a 96
2 54 54 100
3 7 n/a 96
4 32 32 100
5 76 76 100
Some conditions:
It may occur that a column is all floats (though they appear to be integers upon inspection). This was not factored into the original question.
It may also occur that a row contains two instances (columns) of 'n/a' values. This was addressed by the previous solution.
Here is the previous solution:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -= (df.loc[:, idx[('A','B'),:]] == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
It works, except for where a column (A or B) contains all floats (seemingly integers). When that's the case, this error occurs:
TypeError: Could not compare ['n/a'] with block values
I think you can add casting to string by astype to condition:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -=
(df.loc[:, idx[('A','B'),:]].astype(str) == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100
1 3 0 96
2 54 54 100
3 7 0 96
4 32 32 100
5 76 76 100

how to add a new column in dataframe which divides multiple columns and finds the maximum value

This maybe real simple solution but I am new to python 3 and I have a dataframe with multiple columns. I would like to add a new column to the existing dataframe - which does the following calculation i.e.
New Column = Max((Column A/Column B), (Column C/Column D), (Column E/Column F))
I can do a max based on the following code but wanted to check how can I do div alongwith it.
df['Max'] = df[['Column A','Column B','Column C', 'Column D', 'Column E', 'Column F']].max(axis=1)
Column A Column B Column C Column D Column E Column F Max
3600 36000 22 11 3200 3200 36000
2300 2300 13 26 1100 1200 2300
1300 13000 15 33 1000 1000 13000
Thanks
You can div the df by itself by slicing the columns in steps and then take the max:
In [105]:
df['Max'] = df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1).max(axis=1)
df
Out[105]:
Column A Column B Column C Column D Column E Column F Max
0 3600 36000 22 11 3200 3200 2
1 2300 2300 13 26 1100 1200 1
2 1300 13000 15 33 1000 1000 1
Here are the intermediate values:
In [108]:
df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1)
Out[108]:
Column A Column C Column E
0 0.1 2.000000 1.000000
1 1.0 0.500000 0.916667
2 0.1 0.454545 1.000000
You can try something like as follows
df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] / V['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
Example
In [14]: df
Out[14]:
A B C D E F
0 1 11 1 11 12 98
1 2 22 2 22 67 1
2 3 33 3 33 23 4
3 4 44 4 44 11 10
In [15]: df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] /
v['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
In [16]: df
Out[16]:
A B C D E F Max
0 1 11 1 11 12 98 0.122449
1 2 22 2 22 67 1 67.000000
2 3 33 3 33 23 4 5.750000
3 4 44 4 44 11 10 1.100000

Resources