Apply sklearn.preprocessing.MinMaxScaler function based on specific group/id in Python - python-3.x

I have a dataframe given as such:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': ['A', 'A', 'A', 'A', 'A','A', 'A',
'B', 'B', 'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D',
'E', 'E', 'E', 'E', 'E','E'],
'cycle': [1,2, 3, 4, 5,6,7,
1,2, 3,4,5,6,
1,2, 3, 4, 5,
1,2, 3, 4,
1,2, 3, 4, 5,6,],
'Salary': [7, 7, 7,8,9,10,15,
4, 4, 4,4,5,6,
8,9,10,12,13,
8,9,10,11,
7, 11,12,13,14,15,],
'Jobs': [123, 18, 69, 65, 120, 11, 52,
96, 120,10, 141, 52,6,
101,99, 128, 1, 141,
141,123, 12, 66,
12, 128, 66, 100, 141, 52,],
'Days': [123, 128, 66, 66, 120, 141, 52,
96, 120,120, 141, 52,96,
15,123, 128, 120, 141,
141,123, 128, 66,
123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such:
Here, I wish to apply sklearn.preprocessing.MinMaxScaler on the columns 'Salary', 'Jobs', 'Days' corresponding to that specific group/id as such:
Can somebody please let me know how to achieve this task in Python?

You can use groupby to compute minmax scale per group:
cols = ['Salary', 'Jobs', 'Days']
minmax_scale = lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis=0))
df[cols] = df.groupby('id')[cols].apply(minmax_scale)
Output:
>>> df
id cycle Salary Jobs Days
0 A 1 0.000000 1.000000 0.797753
1 A 2 0.000000 0.062500 0.853933
2 A 3 0.000000 0.517857 0.157303
3 A 4 0.125000 0.482143 0.157303
4 A 5 0.250000 0.973214 0.764045
5 A 6 0.375000 0.000000 1.000000 # Max for Days of Group A
6 A 7 1.000000 0.366071 0.000000 # Min for Days of Group A
7 B 1 0.000000 0.666667 0.494382
8 B 2 0.000000 0.844444 0.764045
9 B 3 0.000000 0.029630 0.764045
10 B 4 0.000000 1.000000 1.000000
11 B 5 0.500000 0.340741 0.000000
12 B 6 1.000000 0.000000 0.494382
13 C 1 0.000000 0.714286 0.000000
14 C 2 0.200000 0.700000 0.857143
15 C 3 0.400000 0.907143 0.896825
16 C 4 0.800000 0.000000 0.833333
17 C 5 1.000000 1.000000 1.000000
18 D 1 0.000000 1.000000 1.000000
19 D 2 0.333333 0.860465 0.760000
20 D 3 0.666667 0.000000 0.826667
21 D 4 1.000000 0.418605 0.000000
22 E 1 0.000000 0.000000 0.797753
23 E 2 0.500000 0.899225 0.853933
24 E 3 0.625000 0.418605 0.157303
25 E 4 0.750000 0.682171 0.764045
26 E 5 0.875000 1.000000 1.000000
27 E 6 1.000000 0.310078 0.000000
As suggested by #mozway, you can use a function or the walrus operator in lambda function:
# The fastest
def minmax_scale(x):
xmin = x.min(axis=0)
xmax = x.max(axis=0)
return (x - xmin) / (xmax - xmin)
# Average performance, using walrus operator (Python >= 3.8)
minmax_scale = lambda x: (x - (m := x.min(axis=0))) / (x.max(axis=0) - m)
# The slowest
minmax_scale = lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis=0))

Related

dataframes list concatenation/merging introduces nan values

I have these dataframes:
import pandas as pd
import numpy as np
from functools import reduce
a = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [0.1, np.nan, 0.3, np.nan, 0.5], 'B_val': [1.233, np.nan, 1.4, np.nan, 1.9]})
b = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [np.nan, 0.2, np.nan, 0.4, np.nan], 'B_val': [np.nan, 1.56, np.nan, 1.1, np.nan]})
c = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [121, np.nan, 334, np.nan, 555], 'D_val': [10.233, np.nan, 10.4, np.nan, 10.9]})
d = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [np.nan, 322, np.nan, 454, np.nan], 'D_val': [np.nan, 10.56, np.nan, 10.1, np.nan]})
I am dropping the nan values:
a.dropna(inplace=True)
b.dropna(inplace=True)
c.dropna(inplace=True)
d.dropna(inplace=True)
And then , I want to merge them and have this result:
id gr_code A_val B_val C_val D_val
1 121 0.1 1.233 121.0 10.233
2 121 0.2 1.56 322 10.56
3 134 0.3 1.400 334.0 10.400
4 155 0.4 1.10 454.0 10.10
5 156 0.5 1.900 555.0 10.900
but whatever I try , it introduces nan values.
For example:
df = pd.concat([a, b, c, d], axis=1)
df = df.loc[:,~df.columns.duplicated()]
gives:
id gr_code A_val B_val C_val D_val
1.0 121.0 0.1 1.233 121.0 10.233
3.0 134.0 0.3 1.400 334.0 10.400
5.0 156.0 0.5 1.900 555.0 10.900
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
If I try:
df_list = [a, b, c, d]
df = reduce(lambda left, right: pd.merge(left, right,
on=['id', 'gr_code'],
how='outer'), df_list)
it gives:
id gr_code A_val_x B_val_x A_val_y B_val_y C_val_x D_val_x C_val_y D_val_y
1 121 0.1 1.233 NaN NaN 121.0 10.233 NaN NaN
3 134 0.3 1.400 NaN NaN 334.0 10.400 NaN NaN
5 156 0.5 1.900 NaN NaN 555.0 10.900 NaN NaN
2 121 NaN NaN 0.2 1.56 NaN NaN 322.0 10.56
4 155 NaN NaN 0.4 1.10 NaN NaN 454.0 10.10
more dataframes:
e = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [0.11, np.nan, 0.13, np.nan, 0.35], 'F_val': [11.233, np.nan, 11.4, np.nan, 11.9]})
f = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [np.nan, 3222, np.nan, 4541, np.nan], 'F_val': [np.nan, 110.56, np.nan, 101.1, np.nan]})
You can use concat and merge the duplicated columns:
df = (pd.concat([d.set_index(['id', 'gr_code']) for d in df_list], axis=1)
.groupby(level=0, axis=1).first().reset_index()
)
output:
id gr_code A_val B_val C_val D_val
0 1 121 0.1 1.233 121.0 10.233
1 2 121 0.2 1.560 322.0 10.560
2 3 134 0.3 1.400 334.0 10.400
3 4 155 0.4 1.100 454.0 10.100
4 5 156 0.5 1.900 555.0 10.900

python - Iterating over multi-index pandas dataframe

I´m trying to iterate over a huge pandas dataframe (over 370.000 rows) based on the index.
For each row the code should look back on the last 12 entries of this index (if available) and sum up based on (running) quarters / semesters / year.
If there is no information or not enough information (only last 3 months) then the code should consider the other months / quarters as 0.
Here is a sample of my dataframe:
This is the expected output:
So looking at DateID "1" we don´t have any other information for this row. DateID "1" is the last month in this case (month 12 so to say) and therefore in Q4 and H2. All other previous month are not existing and therefore not considered.
I already found a working solution but its very inefficient and takes a huge amount of time that is not acceptable.
Here is my code sample:
for company_name, c in df.groupby('Account Name'):
for i, row in c.iterrows():
i += 1
if i < 4:
q4 = c.iloc[:i]['Value$'].sum()
q3 = 0
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 3 < i < 7:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[:i-3]['Value$'].sum()
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 6 < i < 10:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[:i-6]['Value$'].sum()
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 9 < i < 13:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
else:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[i-12:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)
As I said I´m looking for a more efficient way to calculate these numbers as I have almost 10.000 Account Names and 30 Date ID´s per Account.
Thanks a lot!
If I got you right, this should calculate your figures:
grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
df['Q4']= last_3
df['Q3']= last_6 - last_3
df['Q2']= last_9 - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']
This outputs:
Out[19]:
Account Name DateID Value$ Q4 Q3 Q2 Q1 H1 H2
0 A 0 33 33.0 0.0 0.0 0.0 0.0 33.0
1 A 1 20 53.0 0.0 0.0 0.0 0.0 53.0
2 A 2 24 77.0 0.0 0.0 0.0 0.0 77.0
3 A 3 21 65.0 33.0 0.0 0.0 0.0 98.0
4 A 4 22 67.0 53.0 0.0 0.0 0.0 120.0
5 A 5 31 74.0 77.0 0.0 0.0 0.0 151.0
6 A 6 30 83.0 65.0 33.0 0.0 33.0 148.0
7 A 7 23 84.0 67.0 53.0 0.0 53.0 151.0
8 A 8 11 64.0 74.0 77.0 0.0 77.0 138.0
9 A 9 35 69.0 83.0 65.0 33.0 98.0 152.0
10 A 10 32 78.0 84.0 67.0 53.0 120.0 162.0
11 A 11 31 98.0 64.0 74.0 77.0 151.0 162.0
12 A 12 32 95.0 69.0 83.0 65.0 148.0 164.0
13 A 13 20 83.0 78.0 84.0 67.0 151.0 161.0
14 A 14 15 67.0 98.0 64.0 74.0 138.0 165.0
15 B 0 44 44.0 0.0 0.0 0.0 0.0 44.0
16 B 1 43 87.0 0.0 0.0 0.0 0.0 87.0
17 B 2 31 118.0 0.0 0.0 0.0 0.0 118.0
18 B 3 10 84.0 44.0 0.0 0.0 0.0 128.0
19 B 4 13 54.0 87.0 0.0 0.0 0.0 141.0
20 B 5 20 43.0 118.0 0.0 0.0 0.0 161.0
21 B 6 28 61.0 84.0 44.0 0.0 44.0 145.0
22 B 7 14 62.0 54.0 87.0 0.0 87.0 116.0
23 B 8 20 62.0 43.0 118.0 0.0 118.0 105.0
24 B 9 41 75.0 61.0 84.0 44.0 128.0 136.0
25 B 10 39 100.0 62.0 54.0 87.0 141.0 162.0
26 B 11 46 126.0 62.0 43.0 118.0 161.0 188.0
27 B 12 26 111.0 75.0 61.0 84.0 145.0 186.0
28 B 13 24 96.0 100.0 62.0 54.0 116.0 196.0
29 B 14 34 84.0 126.0 62.0 43.0 105.0 210.0
32 C 2 12 12.0 0.0 0.0 0.0 0.0 12.0
33 C 3 15 27.0 0.0 0.0 0.0 0.0 27.0
34 C 4 45 72.0 0.0 0.0 0.0 0.0 72.0
35 C 5 22 82.0 12.0 0.0 0.0 0.0 94.0
36 C 6 48 115.0 27.0 0.0 0.0 0.0 142.0
37 C 7 45 115.0 72.0 0.0 0.0 0.0 187.0
38 C 8 11 104.0 82.0 12.0 0.0 12.0 186.0
39 C 9 27 83.0 115.0 27.0 0.0 27.0 198.0
For the following test data:
data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}
df= pd.DataFrame(data)
Edit:: If you want to count the unique entires over the same period, you can do that as follows:
def get_nunique(np_array):
unique, counts= np.unique(np_array, return_counts=True)
return len(unique)
df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)
I didn't want to overload the answer above completely, so I add a new one for your second part:
# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
df_mapping= None
for offset in range(period_lenght):
quarter= (df['DateID']+offset) // period_lenght
aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
aggregated.set_index('DateID', append=True, inplace=True)
aggregated= aggregated.droplevel(0, axis='index')
if df_mapping is None:
df_mapping= aggregated
else:
df_mapping= pd.concat([df_mapping, aggregated], axis='index')
return df_mapping
# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
This results in:
Out[305]:
Account Name DateID Value$ Category unique_3_months unique_6_months
0 A 0 10 1 1 1
1 A 1 12 2 2 2
2 A 1 38 1 2 2
3 A 2 20 3 3 3
4 A 3 25 3 3 3
5 A 4 24 4 2 4
6 A 5 27 8 3 5
7 A 6 30 5 3 6
8 A 7 47 7 3 5
9 A 8 30 4 3 5
10 A 9 17 7 2 4
11 A 10 20 8 3 4
12 A 11 33 8 2 4
13 A 12 45 9 2 4
14 A 13 19 2 3 5
15 A 14 24 10 3 3
15 A 14 24 10 3 4
15 A 14 24 10 3 4
15 A 14 24 10 3 5
15 A 14 24 10 3 1
15 A 14 24 10 3 2
16 B 0 41 2 1 1
17 B 1 13 9 2 2
18 B 2 17 6 3 3
19 B 3 45 7 3 4
20 B 4 11 6 2 4
21 B 5 38 8 3 5
22 B 6 44 8 2 4
23 B 7 15 8 1 3
24 B 8 50 2 2 4
25 B 9 27 7 3 4
26 B 10 38 10 3 4
27 B 11 25 6 3 5
28 B 12 25 8 3 5
29 B 13 14 7 3 5
30 B 14 25 9 3 3
30 B 14 25 9 3 4
30 B 14 25 9 3 5
30 B 14 25 9 3 5
30 B 14 25 9 3 1
30 B 14 25 9 3 2
31 C 2 31 9 1 1
32 C 3 31 7 2 2
33 C 4 26 5 3 3
34 C 5 11 2 3 4
35 C 6 15 8 3 5
36 C 7 22 2 2 5
37 C 8 33 2 2 4
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 1
38 C 9 16 5 2 2
38 C 9 16 5 2 2
The output is based on the following input data:
data= {
'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}
df= pd.DataFrame(data)

Data partition on known and unknown rows

I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables.
df = {'Id' : [1, 2, 3, 4, 5],
'First' : [30, 22, 18, 49, 22],
'Second' : [80, 28, 16, 56, 30],
'Third' : [14, None, None, 30, 27],
'Fourth' : [14, 85, 17, 22, 14],
'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth'])
df
Two separate lists with all Known variables and another one with Unknown variables
Let me know if this helps :
df['TF']= df.isnull().any(axis=1)
df_without_none = df[df['TF'] == 0]
df_with_none = df[df['TF'] == 1]
print(df_without_none.head())
print(df_with_none.head())
#### Input ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
#### Output ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
Id First Second Third Fourth Fruit Total TF
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True

Add columns with normalised rankings to a pandas dataframe

I would like to add a column with normalized rankings to a pandas dataframe. The process is as follows:
Import the pandas package first.
#import packages
import pandas as pd
Define a pandas dataframe.
# Create dataframe
data = {'name': ['Jason', 'Jason', 'Tina', 'Tina', 'Tina'],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
After the dataframe is created, I want to add an extra column to the dataframe. This column contains the rank based on the values in the coverage column for every name seperately.
df['coverageRank'] = df.groupby('name')['coverage'].rank()
print (df)
coverage name reports coverageRank
0 25 Jason 4 1.0
1 94 Jason 24 2.0
2 57 Tina 31 1.0
3 62 Tina 2 2.0
4 70 Tina 3 3.0
I now want to normalize the values in the ranking column.
The desired output is
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000
Does someone know a way to do this without using an explicit for-loop?
You can use transform for Series with same size as original df and then divide by div:
a = df.groupby('name')['coverage'].transform('size')
print (a)
0 2
1 2
2 3
3 3
4 3
Name: coverage, dtype: int64
df['coverageRank'] = df.groupby('name')['coverage'].rank().div(a)
print (df)
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000
Another solution with apply:
df['coverageRank'] = df.groupby('name')['coverage'].apply(lambda x: x.rank() / len(x))
print (df)
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000

How to perform conditional dataframe operations?

Given df
A = pd.DataFrame([[1, 5, 2, 1, 2], [2, 4, 4, 1, 2], [3, 3, 1, 1, 2], [4, 2, 2, 3, 0],
[5, 1, 4, 3, -4], [1, 5, 2, 3, -20], [2, 4, 4, 2, 0], [3, 3, 1, 2, -1],
[4, 2, 2, 2, 0], [5, 1, 4, 2, -2]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How can I create a column 'f', that corresponds to the last value in column 'e' before a change in value in column 'd', and holds that value until the next change in value in column 'd' the output would be:
a b c d e f
1 1 5 2 1 2 nan
2 2 4 4 1 2 nan
3 3 3 1 1 2 nan
4 4 2 2 3 0 2
5 5 1 4 3 -4 2
6 1 5 2 3 -20 2
7 2 4 4 2 0 -20
8 3 3 1 2 -1 -20
9 4 2 2 2 0 -20
10 5 1 4 2 -2 -20
Edit: #Noobie presented a solution that when applied in real data, it breaks down when there's a smaller than previous value in column 'd'
I think we should offer better native support for dealing with contiguous groups, but until then you can us the compare-cumsum-groupby pattern:
g = (A["d"] != A["d"].shift()).cumsum()
A["f"] = A["e"].groupby(g).last().shift().loc[g].values
which gives me
In [41]: A
Out[41]:
a b c d e f
1 1 5 2 1 2 NaN
2 2 4 4 1 2 NaN
3 3 3 1 1 2 NaN
4 4 2 2 2 0 2.0
5 5 1 4 2 -4 2.0
6 1 5 2 2 -20 2.0
7 2 4 4 3 0 -20.0
8 3 3 1 3 -1 -20.0
9 4 2 2 3 0 -20.0
10 5 1 4 3 -2 -20.0
This works because g is a count corresponding to each contiguous group of d values. Note that in this case, using the example you posted, g is the same as column "d", but that needn't be the case. Once we have g, we can use it to group column e:
In [55]: A["e"].groupby(g).last()
Out[55]:
d
1 2
2 -20
3 -2
Name: e, dtype: int64
and then
In [57]: A["e"].groupby(g).last().shift()
Out[57]:
d
1 NaN
2 2.0
3 -20.0
Name: e, dtype: float64
In [58]: A["e"].groupby(g).last().shift().loc[g]
Out[58]:
d
1 NaN
1 NaN
1 NaN
2 2.0
2 2.0
2 2.0
3 -20.0
3 -20.0
3 -20.0
3 -20.0
Name: e, dtype: float64
easy my friend. unleash the POWER OF PANDAS !
A.sort_values(by = 'd', inplace = True)
A['lag'] = A.e.shift(1)
A['output'] = A.groupby('d').lag.transform(lambda x : x.iloc[0])
A
Out[57]:
a b c d e lag output
1 1 5 2 1 2 NaN NaN
2 2 4 4 1 2 2.0 NaN
3 3 3 1 1 2 2.0 NaN
4 4 2 2 2 0 2.0 2.0
5 5 1 4 2 -4 0.0 2.0
6 1 5 2 2 -20 -4.0 2.0
7 2 4 4 3 0 -20.0 -20.0
8 3 3 1 3 -1 0.0 -20.0
9 4 2 2 3 0 -1.0 -20.0
10 5 1 4 3 -2 0.0 -20.0

Resources