Pandas find max column, subtract from another column and replace the value - python-3.x

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']

Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`
Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data
You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

I have pandas dataframe with 3 columns and want output like this

DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7
Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7
You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |

Only drop duplicates if number of duplicates is less than X

I need to drop duplicate rows in my DataFrame only if the number of duplicates is less than x (e.g. 3)
(if more than 3 duplicates, keep them !)
Sample:
where count is number of duplicates and duplicates are in col data
data | count
-------------
a | 1
b | 2
b | 2
c | 1
d | 3
d | 3
d | 3
Desired result:
data | count
-------------
a | 1
b | 1
c | 1
d | 3
d | 3
d | 3
How can i achieve this? Thanks in advance.
I believe you need chain conditions with Series.duplicated and get greater or equal values of N in boolean indexing, last set 1 for count column:
N = 3
df1 = df[~df.duplicated('data') | df['count'].ge(N)].copy()
df1.loc[df['count'] < N, 'count'] = 1
print (df1)
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
IIUC, you could do the following:
# create mask for non-duplicates and groups larger than 3
mask = (df.groupby('data')['count'].transform('count') >= 3) | ~df.duplicated('data')
# filter
filtered = df.loc[mask].drop('count', axis=1)
# reset count column
filtered['count'] = filtered.groupby('data')['data'].transform('count')
print(filtered)
Output
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
N = 3
df['count'] = df['count'].apply(lambda x: 1 if x < N else x)
result = pd.concat([df[df['count'].eq(1)].drop_duplicates(), df[df['count'].eq(N)]])
result
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3

Looping to create a new column based on other column values in Python Dataframe [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance
recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25
You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)

Excel, Libreoffice/Openoffice Calc: count 'right' answers

I have a table with students' answers to 20 math problems like this:
A | B | C | D | E |...
------------+-----+-----+-----+-----+...
problem no | 1 | 2 | 3 | 4 |...
------------+-----+-----+-----+-----+...
right answer| 3 | 2 | A | 15 |...
------------+-----+-----+-----+-----+...
student1 | 3 | 4 | A | 12 |...
student2 | 2 | 2 | C | 15 |...
student3 | 3 | 2 | A | 13 |...
Now a need a column that counts the 'right' answers for each student.
I can do it this so: =(IF(D$3=D5;1;0))+(IF(E$3=E5;1;0))+(IF(F$3=F5;1;0))+...
...but it's not the nicest way :)
This is a typical use case for SUMPRODUCT:
A B C D E F G
1 problem no 1 2 3 4
2 right answer 3 2 A 15 right answers per student
3 student1 3 4 A 12 2
4 student2 2 2 C 15 2
5 student3 3 2 A 13 3
Formula in G3:
=SUMPRODUCT($B$2:$E$2=$B3:$E3)
If there are more problem numbers, then the column letters in $E$2 and $E3 have to be increased.
How it works:
SUMPRODUCT takes its inner functions as array formulas. So the $B$2:$E$2=$B3:$E3 becomes a matrix of {TRUE, FALSE, TRUE, FALSE} depending of if $B$2=$B3, $C$2=$C3, $D$2=$D3, $E$2=$E3.
In Libreoffice or Openoffice TRUE is 1 and FALSE is 0. So the SUMPRODUCT sums all TRUEs.
In Excel you have to get the boolean values in numeric context first. So the Formula in Excel will be =SUMPRODUCT(($B$2:$E$2=$B3:$E3)*1).
The formula in Row 3 then can be filled down for all student rows. The $ before the row number 2 ensures that thereby the row of the right answers not changes.
Greetings
Axel

Resources