Pandas rearrenging columns with MultiIndex - python-3.x

I need to perform rearrenging column operations that include:
find difference respect to a list
create new columns sortes as list
Keep multiindex
import pandas as pd
cols = ['c1','c2','c3','c4']
df=pd.DataFrame(data=[[1,1,2,2,3,3],[4,4,5,5,6,6],[7,7,8,8,9,9]],index=['r1','r2','r3'],columns=pd.MultiIndex.from_tuples([('c2','mean'),('c2','max'),('c1','mean'),('c1','max'),('c3','mean'),('c3','max')]))
df
Out[52]:
c2 c1 c3
mean max mean max mean max
r1 1 1 2 2 3 3
r2 4 4 5 5 6 6
r3 7 7 8 8 9 9
so, the final result is:
df
Out[52]:
c1 c2 c3 c4
mean max mean max mean max mean max
r1 2 2 1 1 3 3 NaN NaN
r2 5 5 4 4 6 6 NaN NaN
r3 8 8 7 7 9 9 NaN NaN

Use DataFrame.reindex for change order and also add missing combinations to original data with new MultiIndex, e.g. created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([['c1','c2','c3','c4'], ['mean','max']])
df = df.reindex(mux, axis=1)
print (df)
c1 c2 c3 c4
mean max mean max mean max mean max
r1 2 2 1 1 3 3 NaN NaN
r2 5 5 4 4 6 6 NaN NaN
r3 8 8 7 7 9 9 NaN NaN

Related

Groupby, compare one column value with another column's maximum value in Pandas

Given a dataframe df as follows:
id building floor_number floor_name
0 1 A 8 5F
1 2 A 4 4F
2 3 A 3 3F
3 4 A 2 2F
4 5 A 1 1F
5 6 B 14 17F
6 7 B 13 16F
7 8 B 20 world
8 9 B 13 hello
9 10 B 13 16F
I need to extract values from floor_name column then: groupby building then compare floor_number's values for each row with floor_name's maximum values, if floor number is bigger than the extracted values from floor name, then return new column check with content invalid floor number.
This is expected result:
id building ... floor_name check
0 1 A ... 5F invalid floor number
1 2 A ... 4F NaN
2 3 A ... 3F NaN
3 4 A ... 2F NaN
4 5 A ... 1F NaN
5 6 B ... 17F NaN
6 7 B ... 16F NaN
7 8 B ... world invalid floor number
8 9 B ... hello NaN
9 10 B ... 16F NaN
For extract values from floor_name, groupby building and get max for floor_name, I have used:
df['floor_name'] = df['floor_name'].str.extract('(\d*)', expand = False)
df.groupby('building')['floor_name'].max()
Out:
building
A 5
B 17
Name: floor_name, dtype: object
How could I finish the rest of code? Thanks at advance.
Use groupby().transform(). Also, it's better to convert to numeric type, since '2' > '17':
numeric_floors = (df['floor_name'].str.extract('(\d+)', # use \d+ instead of *
expand=False)
.astype(float) # convert to numeric type
.groupby(df['building'])
.transform('max')
)
df.loc[df['floor_number'] > numeric_floors, 'check'] = 'invalid floor number'
Output:
id building floor_number floor_name check
0 1 A 8 5F invalid floor number
1 2 A 4 4F NaN
2 3 A 3 3F NaN
3 4 A 2 2F NaN
4 5 A 1 1F NaN
5 6 B 14 17F NaN
6 7 B 13 16F NaN
7 8 B 20 world invalid floor number
8 9 B 13 hello NaN
9 10 B 13 16F NaN

map data from one column to another

I have two DataFrames d1 and d2.
d1:
category value
0 a 4
1 b 9
2 c 14
3 d 19
4 e 24
5 f 29
d2:
one two
0 NaN a
1 NaN a
2 NaN c
3 NaN d
4 NaN e
5 NaN a
I want to map values from d1 to 'one' column in d2 using category marker form d1.
this should return me:
one two
0 4 a
1 4 a
2 14 c
3 19 d
4 24 e
5 4 a
Try:
df2['one'] = df2['two'].map(df1.set_index('category')['value'])

Concatenating Pandas dataframes, getting Nan values for first dataframe

I'm trying to join two dataframes. 'df' is my initial dataframe containing all the header information I require. 'row' is my first row of data that I want to append to 'df'.
df =
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
row =
0 1 2 3 4 5 6
0 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I'm trying to create this:
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I have tried the following:
df = df.append(row, ignore_index=True)
and
df = pd.concat([df, row], ignore_index=True)
Both of these result in the loss of all the data in the first df, which should contain all the header information.
0 1 2 3 4 5 6
0 Nan Nan Nan Nan Nan Nan Nan
1 Nan Nan Nan Nan Nan Nan Nan
2 Nan Nan Nan Nan Nan Nan Nan
3 Nan Nan Nan Nan Nan Nan Nan
4 Nan Nan Nan Nan Nan Nan Nan
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I've also tried
df = pd.concat([df.reset_index(drop=True, inplace=True), row.reset_index(drop=True, inplace=True)])
Which produced the following Traceback
Traceback (most recent call last):
File "<ipython-input-146-3c1ecbd1987c>", line 1, in <module>
df = pd.concat([df.reset_index(drop=True, inplace=True), row.reset_index(drop=True, inplace=True)])
File "C:\Users\russells\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 228, in concat
copy=copy, sort=sort)
File "C:\Users\russells\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 280, in __init__
raise ValueError('All objects passed were None')
ValueError: All objects passed were None
Does anyone know what I'm doing wrong?
When you concatenate extra rows, pandas aligns the columns, which currently do not overlap. rename will get the job done:
pd.concat([df, row.rename(columns=dict(zip(row.columns, df.columns)))],
ignore_index=True)
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
Or if you just need to assign one row at the end and you have a RangeIndex on df:
df.loc[df.shape[0], :] = row.to_numpy()

shifting a column down in a pandas dataframe

I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3
One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN

Python: Summing every five rows of column b data and create a new column

I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0

Resources