Replace NaN with existing value of the group - python-3.x

Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB NaN
4 CC NaN
5 CC 23
6 CC 45
How can I replace these NaN with existing values by looking at column Name? For CC I would like to get the max (but if it is too convoluted, then I am fine with either 23 or 45). The expected output:
Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB 23
4 CC 45
5 CC 23
6 CC 45
Thanks!

You can groupby and transform with max then fillna:
df['Value'] = df['Value'].fillna(df.groupby("Name")['Value'].transform('max'))
print(df)
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0

You can also use lambda with transform
df["Value"] = df.groupby('Name').transform(lambda x:x.fillna(x.max()))
df
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0

Related

Python: Groupby and sum respective rows and update dataframe column

Input df:
Store Category Item tot_table
11 AA Apple 13.5
11 AA Orange 13.5
11 BB Potato 11.5
11 BB Carrot 11.5
12 AA Apple 10
12 BB Potato 9
12 BB Carrot 9
Need to perform df.groupby('Store')['tot_table'].unique().sum() , but this line of code doesn't work out.
Expected output df:
Store Category Item split_table tot_table
11 AA Apple 13.5 25
11 AA Orange 13.5 25
11 BB Potato 11.5 25
11 BB Carrot 11.5 25
12 AA Apple 10 19
12 BB Potato 9 19
12 BB Carrot 9 19
You can use groupby.transform with unique/sum:
df['tot_table'] = (df.groupby('Store')['tot_table']
.transform(lambda s: s.unique().sum())
)
output:
Store Category Item tot_table
0 11 AA Apple 25.0
1 11 AA Orange 25.0
2 11 BB Potato 25.0
3 11 BB Carrot 25.0
4 12 AA Apple 19.0
5 12 BB Potato 19.0
6 12 BB Carrot 19.0

Groupby, compare one column value with another column's maximum value in Pandas

Given a dataframe df as follows:
id building floor_number floor_name
0 1 A 8 5F
1 2 A 4 4F
2 3 A 3 3F
3 4 A 2 2F
4 5 A 1 1F
5 6 B 14 17F
6 7 B 13 16F
7 8 B 20 world
8 9 B 13 hello
9 10 B 13 16F
I need to extract values from floor_name column then: groupby building then compare floor_number's values for each row with floor_name's maximum values, if floor number is bigger than the extracted values from floor name, then return new column check with content invalid floor number.
This is expected result:
id building ... floor_name check
0 1 A ... 5F invalid floor number
1 2 A ... 4F NaN
2 3 A ... 3F NaN
3 4 A ... 2F NaN
4 5 A ... 1F NaN
5 6 B ... 17F NaN
6 7 B ... 16F NaN
7 8 B ... world invalid floor number
8 9 B ... hello NaN
9 10 B ... 16F NaN
For extract values from floor_name, groupby building and get max for floor_name, I have used:
df['floor_name'] = df['floor_name'].str.extract('(\d*)', expand = False)
df.groupby('building')['floor_name'].max()
Out:
building
A 5
B 17
Name: floor_name, dtype: object
How could I finish the rest of code? Thanks at advance.
Use groupby().transform(). Also, it's better to convert to numeric type, since '2' > '17':
numeric_floors = (df['floor_name'].str.extract('(\d+)', # use \d+ instead of *
expand=False)
.astype(float) # convert to numeric type
.groupby(df['building'])
.transform('max')
)
df.loc[df['floor_number'] > numeric_floors, 'check'] = 'invalid floor number'
Output:
id building floor_number floor_name check
0 1 A 8 5F invalid floor number
1 2 A 4 4F NaN
2 3 A 3 3F NaN
3 4 A 2 2F NaN
4 5 A 1 1F NaN
5 6 B 14 17F NaN
6 7 B 13 16F NaN
7 8 B 20 world invalid floor number
8 9 B 13 hello NaN
9 10 B 13 16F NaN

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

Manipulating Pandas Columns into different labels [duplicate]

Let's assume that I have the following dataframe in pandas:
AA BB CC
date
05/03 1 2 3
06/03 4 5 6
07/03 7 8 9
08/03 5 7 1
and I want to transform it to the following:
AA 05/03 1
AA 06/03 4
AA 07/03 7
AA 08/03 5
BB 05/03 2
BB 06/03 5
BB 07/03 8
BB 08/03 7
CC 05/03 3
CC 06/03 6
CC 07/03 9
CC 08/03 1
How can I do it?
The reason of the transformation from wide to long is that, in the next stage, I would like to merge this dataframe with another one, based on dates and the initial column names (AA, BB, CC).
Use pandas.melt or pandas.DataFrame.melt to transform from wide to long:
df = pd.DataFrame({
'date' : ['05/03', '06/03', '07/03', '08/03'],
'AA' : [1, 4, 7, 5],
'BB' : [2, 5, 8, 7],
'CC' : [3, 6, 9, 1]
}).set_index('date')
df
AA BB CC
date
05/03 1 2 3
06/03 4 5 6
07/03 7 8 9
08/03 5 7 1
To convert, we just need to reset the index and then melt:
df = df.reset_index()
pd.melt(df, id_vars='date', value_vars=['AA', 'BB', 'CC'])
Using .reset_index after .melt, removes the need to specify value_vars.
dfm = df.melt(ignore_index=False).reset_index()
Final Result - both options
date variable value
0 05/03 AA 1
1 06/03 AA 4
2 07/03 AA 7
3 08/03 AA 5
4 05/03 BB 2
5 06/03 BB 5
6 07/03 BB 8
7 08/03 BB 7
8 05/03 CC 3
9 06/03 CC 6
10 07/03 CC 9
11 08/03 CC 1
Update
As George Liu has shown in another answer, pd.melt is the idiomatic, flexible and fast solution to this problem. Do not use unstack for this.
unstack returns a series with a multiindex:
In [38]: df.unstack()
Out[38]:
date
AA 05/03 1
06/03 4
07/03 7
08/03 5
BB 05/03 2
06/03 5
07/03 8
08/03 7
CC 05/03 3
06/03 6
07/03 9
08/03 1
dtype: int64
You can call reset_index on the returning series:
In [39]: df.unstack().reset_index()
Out[39]:
level_0 date 0
0 AA 05-03 1
1 AA 06-03 4
2 AA 07-03 7
3 AA 08-03 5
4 BB 05-03 2
5 BB 06-03 5
6 BB 07-03 8
7 BB 08-03 7
8 CC 05-03 3
9 CC 06-03 6
10 CC 07-03 9
11 CC 08-03 1
Or construct a dataframe with a multiindex:
In [40]: pd.DataFrame(df.unstack())
Out[40]:
0
date
AA 05-03 1
06-03 4
07-03 7
08-03 5
BB 05-03 2
06-03 5
07-03 8
08-03 7
CC 05-03 3
06-03 6
07-03 9
08-03 1

Multiply 2 different dataframe with same dimension and repeating rows

I am trying to multiply two data frame
Df1
Name|Key |100|101|102|103|104
Abb AB 2 6 10 5 1
Bcc BC 1 3 7 4 2
Abb AB 5 1 11 3 1
Bcc BC 7 1 4 5 0
Df2
Key_1|100|101|102|103|104
AB 10 2 1 5 1
BC 1 10 2 2 4
Expected Output
Name|Key |100|101|102|103|104
Abb AB 20 12 10 25 1
Bcc BC 1 30 14 8 8
Abb AB 50 2 11 15 1
Bcc BC 7 10 8 10 0
I have tried grouping Df1 and then multiplying with Df2 but it didn't work
Please help me on how to approach this problem
You can rename the df2 Key_1 to Key(similar to df1) , then set index and mul on level=1
df1.set_index(['Name','Key']).mul(df2.rename(columns={'Key_1':'Key'})
.set_index('Key'),level=1).reset_index()
Or similar:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1')
.rename_axis('Key'),level=1).reset_index()
As correctly pointed by #QuangHoang , you can do without renaming too:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1'),level=1).reset_index()
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 1 30 14 8 8
2 Abb AB 50 2 11 15 1
3 Bcc BC 7 10 8 10 0
IIUC reindex_like
df1.set_index('Key',inplace=True)
df1=df1.mul(df2.set_index('Key_1').reindex_like(df1).values).fillna(df1)
Out[235]:
Name 100 101 102 103 104
Key
AB Abb 20.0 12.0 10.0 25.0 1.0
BC Bcc 1.0 30.0 14.0 8.0 8.0
AB Abb 50.0 2.0 11.0 15.0 1.0
BC Bcc 7.0 10.0 8.0 10.0 0.0
We could also use DataFrame.merge with pd.Index.difference to select columns.
mul_cols = df1.columns.difference(['Name','Key'])
df1.assign(**df1[mul_cols].mul(df2.merge(df1[['Key']],
left_on = 'Key_1',
right_on = 'Key')[mul_cols]))
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 10 6 7 20 2
2 Abb AB 5 10 22 6 4
3 Bcc BC 7 10 8 10 0

Resources