python - Iterating over multi-index pandas dataframe - python-3.x

I´m trying to iterate over a huge pandas dataframe (over 370.000 rows) based on the index.
For each row the code should look back on the last 12 entries of this index (if available) and sum up based on (running) quarters / semesters / year.
If there is no information or not enough information (only last 3 months) then the code should consider the other months / quarters as 0.
Here is a sample of my dataframe:
This is the expected output:
So looking at DateID "1" we don´t have any other information for this row. DateID "1" is the last month in this case (month 12 so to say) and therefore in Q4 and H2. All other previous month are not existing and therefore not considered.
I already found a working solution but its very inefficient and takes a huge amount of time that is not acceptable.
Here is my code sample:
for company_name, c in df.groupby('Account Name'):
for i, row in c.iterrows():
i += 1
if i < 4:
q4 = c.iloc[:i]['Value$'].sum()
q3 = 0
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 3 < i < 7:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[:i-3]['Value$'].sum()
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 6 < i < 10:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[:i-6]['Value$'].sum()
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 9 < i < 13:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
else:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[i-12:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)
As I said I´m looking for a more efficient way to calculate these numbers as I have almost 10.000 Account Names and 30 Date ID´s per Account.
Thanks a lot!

If I got you right, this should calculate your figures:
grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
df['Q4']= last_3
df['Q3']= last_6 - last_3
df['Q2']= last_9 - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']
This outputs:
Out[19]:
Account Name DateID Value$ Q4 Q3 Q2 Q1 H1 H2
0 A 0 33 33.0 0.0 0.0 0.0 0.0 33.0
1 A 1 20 53.0 0.0 0.0 0.0 0.0 53.0
2 A 2 24 77.0 0.0 0.0 0.0 0.0 77.0
3 A 3 21 65.0 33.0 0.0 0.0 0.0 98.0
4 A 4 22 67.0 53.0 0.0 0.0 0.0 120.0
5 A 5 31 74.0 77.0 0.0 0.0 0.0 151.0
6 A 6 30 83.0 65.0 33.0 0.0 33.0 148.0
7 A 7 23 84.0 67.0 53.0 0.0 53.0 151.0
8 A 8 11 64.0 74.0 77.0 0.0 77.0 138.0
9 A 9 35 69.0 83.0 65.0 33.0 98.0 152.0
10 A 10 32 78.0 84.0 67.0 53.0 120.0 162.0
11 A 11 31 98.0 64.0 74.0 77.0 151.0 162.0
12 A 12 32 95.0 69.0 83.0 65.0 148.0 164.0
13 A 13 20 83.0 78.0 84.0 67.0 151.0 161.0
14 A 14 15 67.0 98.0 64.0 74.0 138.0 165.0
15 B 0 44 44.0 0.0 0.0 0.0 0.0 44.0
16 B 1 43 87.0 0.0 0.0 0.0 0.0 87.0
17 B 2 31 118.0 0.0 0.0 0.0 0.0 118.0
18 B 3 10 84.0 44.0 0.0 0.0 0.0 128.0
19 B 4 13 54.0 87.0 0.0 0.0 0.0 141.0
20 B 5 20 43.0 118.0 0.0 0.0 0.0 161.0
21 B 6 28 61.0 84.0 44.0 0.0 44.0 145.0
22 B 7 14 62.0 54.0 87.0 0.0 87.0 116.0
23 B 8 20 62.0 43.0 118.0 0.0 118.0 105.0
24 B 9 41 75.0 61.0 84.0 44.0 128.0 136.0
25 B 10 39 100.0 62.0 54.0 87.0 141.0 162.0
26 B 11 46 126.0 62.0 43.0 118.0 161.0 188.0
27 B 12 26 111.0 75.0 61.0 84.0 145.0 186.0
28 B 13 24 96.0 100.0 62.0 54.0 116.0 196.0
29 B 14 34 84.0 126.0 62.0 43.0 105.0 210.0
32 C 2 12 12.0 0.0 0.0 0.0 0.0 12.0
33 C 3 15 27.0 0.0 0.0 0.0 0.0 27.0
34 C 4 45 72.0 0.0 0.0 0.0 0.0 72.0
35 C 5 22 82.0 12.0 0.0 0.0 0.0 94.0
36 C 6 48 115.0 27.0 0.0 0.0 0.0 142.0
37 C 7 45 115.0 72.0 0.0 0.0 0.0 187.0
38 C 8 11 104.0 82.0 12.0 0.0 12.0 186.0
39 C 9 27 83.0 115.0 27.0 0.0 27.0 198.0
For the following test data:
data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}
df= pd.DataFrame(data)
Edit:: If you want to count the unique entires over the same period, you can do that as follows:
def get_nunique(np_array):
unique, counts= np.unique(np_array, return_counts=True)
return len(unique)
df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)

I didn't want to overload the answer above completely, so I add a new one for your second part:
# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
df_mapping= None
for offset in range(period_lenght):
quarter= (df['DateID']+offset) // period_lenght
aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
aggregated.set_index('DateID', append=True, inplace=True)
aggregated= aggregated.droplevel(0, axis='index')
if df_mapping is None:
df_mapping= aggregated
else:
df_mapping= pd.concat([df_mapping, aggregated], axis='index')
return df_mapping
# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
This results in:
Out[305]:
Account Name DateID Value$ Category unique_3_months unique_6_months
0 A 0 10 1 1 1
1 A 1 12 2 2 2
2 A 1 38 1 2 2
3 A 2 20 3 3 3
4 A 3 25 3 3 3
5 A 4 24 4 2 4
6 A 5 27 8 3 5
7 A 6 30 5 3 6
8 A 7 47 7 3 5
9 A 8 30 4 3 5
10 A 9 17 7 2 4
11 A 10 20 8 3 4
12 A 11 33 8 2 4
13 A 12 45 9 2 4
14 A 13 19 2 3 5
15 A 14 24 10 3 3
15 A 14 24 10 3 4
15 A 14 24 10 3 4
15 A 14 24 10 3 5
15 A 14 24 10 3 1
15 A 14 24 10 3 2
16 B 0 41 2 1 1
17 B 1 13 9 2 2
18 B 2 17 6 3 3
19 B 3 45 7 3 4
20 B 4 11 6 2 4
21 B 5 38 8 3 5
22 B 6 44 8 2 4
23 B 7 15 8 1 3
24 B 8 50 2 2 4
25 B 9 27 7 3 4
26 B 10 38 10 3 4
27 B 11 25 6 3 5
28 B 12 25 8 3 5
29 B 13 14 7 3 5
30 B 14 25 9 3 3
30 B 14 25 9 3 4
30 B 14 25 9 3 5
30 B 14 25 9 3 5
30 B 14 25 9 3 1
30 B 14 25 9 3 2
31 C 2 31 9 1 1
32 C 3 31 7 2 2
33 C 4 26 5 3 3
34 C 5 11 2 3 4
35 C 6 15 8 3 5
36 C 7 22 2 2 5
37 C 8 33 2 2 4
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 1
38 C 9 16 5 2 2
38 C 9 16 5 2 2
The output is based on the following input data:
data= {
'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}
df= pd.DataFrame(data)

Related

Set upperbound in a column for a specific group by using Python

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'ID': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'Salary': [1, 2, 3, 4, 5,6,7,8,9,10, 1, 2, 3,4,5,6, 1, 2, 3, 4,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Now, for every ID/group, I wish to set an upperbound for some value of 'Salary'.
For example,
For ID=1, the upperbound of 'Salary' should be set at 4
For ID=2, the upperbound of 'Salary' should be set at 3
For ID=3, the upperbound of 'Salary' should be set at 5
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
Use custom function with mapping by helper dictionary in GroupBy.transform:
d = {1:4, 2:3, 3:5}
def f(x):
x.iloc[:d[x.name]] = d[x.name]
return x
df['Salary'] = df.groupby('ID')['Salary'].transform(f)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
Another idea is use GroupBy.cumcount for counter per ID, compared by mapped ID and if match set mapped Series by Series.mask:
d = {1:4, 2:3, 3:5}
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df.groupby('ID').cumcount().lt(s), s)
Or if counter column is in Salary is possible use:
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df['Salary'].le(s), s)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
One option is to create a series from the dictionary, merge with the dataframe and then update the Salary column conditionally:
ser = pd.Series(d, name = 'd')
ser.index.name = 'ID'
(df
.merge(ser, on = 'ID')
.assign(Salary = lambda f: np.where(f.Salary.lt(f.d), f.d, f.Salary))
.drop(columns='d')
)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89

column multiplication based on a mapping

I have the following two dataframes. The first one, maps some nodes to area number and the maximum electric load of that node.
bus = pd.DataFrame(data={'Node':[101, 102, 103, 104, 105], 'Area':[1, 1, 2, 2, 3], 'Load':[10, 15, 12, 20, 25]})
which gives us:
Bus Area Load
0 101 1 10
1 102 1 15
2 103 2 12
3 104 2 20
4 105 3 25
The second dataframe, shows the total electric load of each area over a time period (from hour 0 to 5). The column names are the areas (matching the column Area in dataframe bus.
load = pd.DataFrame(data={1:[20, 18, 17, 19, 22, 25], 2:[23, 25,24, 27, 30, 32], 3:[10, 14, 19, 25, 22, 20]})
which gives us:
1 2 3
0 20 23 10
1 18 25 14
2 17 24 19
3 19 27 25
4 22 30 22
5 25 32 20
I would like to have a dataframe that shows the electric load of each bus over the 6 hours.
Assumption: The percentage of the load over time is the same as the percentage of the maximum load shown in bus; e.g., bus 101 has 10/(10+15)=0.4 percent of the electric load of area 1, therefore, to calculate its hourly load, 10/(10+15) should be multiplied by the column corresponding to area 1 in load.
The desired output should be of the following format:
101 102 103 104 105
0 8 12 8.625 14.375 10
1 7.2 10.8 9.375 15.625 14
2 6.8 10.2 9 15 19
3 7.6 11.4 10.125 16.875 25
4 8.8 13.2 11.25 18.75 22
5 10 15 12 20 20
For column 101, we have 0.4 multiplied by column 1 of load.
Any help is greatly appreaciated.
One option is to get the Load divided by the sum, then pivot, get the index matching for both load and bus, before multiplying on the matching levels:
(bus.assign(Load = bus.Load.div(bus.groupby('Area').Load.transform('sum')))
.pivot(None, ['Area', 'Node'], 'Load')
.reindex(load.index)
.ffill() # get the data spread into all rows
.bfill()
.mul(load, level=0)
.droplevel(0,1)
.rename_axis(columns=None)
)
101 102 103 104 105
0 8.0 12.0 8.625 14.375 10.0
1 7.2 10.8 9.375 15.625 14.0
2 6.8 10.2 9.000 15.000 19.0
3 7.6 11.4 10.125 16.875 25.0
4 8.8 13.2 11.250 18.750 22.0
5 10.0 15.0 12.000 20.000 20.0
You can calculate the ratio in bus, transpose load, merge the two and multiply the ratio by the load, here goes:
bus['area_sum'] = bus.groupby('Area')['Load'].transform('sum')
bus['node_ratio'] = bus['Load'] / bus['area_sum']
full_data = bus.merge(load.T.reset_index(), left_on='Area', right_on='index')
result = pd.DataFrame([full_data['node_ratio'] * full_data[x] for x in range(6)])
result.columns = full_data['Node'].values
result:
101
102
103
104
105
0
8
12
8.625
14.375
10
1
7.2
10.8
9.375
15.625
14
2
6.8
10.2
9
15
19
3
7.6
11.4
10.125
16.875
25
4
8.8
13.2
11.25
18.75
22
5
10
15
12
20
20

Data partition on known and unknown rows

I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables.
df = {'Id' : [1, 2, 3, 4, 5],
'First' : [30, 22, 18, 49, 22],
'Second' : [80, 28, 16, 56, 30],
'Third' : [14, None, None, 30, 27],
'Fourth' : [14, 85, 17, 22, 14],
'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth'])
df
Two separate lists with all Known variables and another one with Unknown variables
Let me know if this helps :
df['TF']= df.isnull().any(axis=1)
df_without_none = df[df['TF'] == 0]
df_with_none = df[df['TF'] == 1]
print(df_without_none.head())
print(df_with_none.head())
#### Input ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
#### Output ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
Id First Second Third Fourth Fruit Total TF
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True

pandas how to assign group id to groups whose sizes are > 1

I want to do a groupby on a df, and then assign each group an id, whose size is > 1;
df_gr = df.groupby(['a', 'b', 'c'])
df_filtered = df_gr.filter(lambda x: len(x) > 1)
if df_filtered.shape[0] == 0:
df_filtered['id'] = -1
else:
# put ids in df_filtered
I am wondering how to do that.
a b c d
10 2017 20.0 231
10 2017 20.0 223
20 2018 10.0 113
30 2017 11.0 134
30 2017 11.0 112
30 2017 11.0 111
the result df,
a b c d id
10 2017 20.0 231 1
10 2017 20.0 223 1
30 2017 11.0 134 2
30 2017 11.0 112 2
30 2017 11.0 111 2
if df_filtered.shape[0] != 0:
df_filtered["id"] = df_filtered.groupby(
['a', 'b', 'c']).grouper.group_info[0]
I think need transform with numpy.where:
df['id'] = np.where(df.groupby(['a', 'b', 'c'])['a'].transform('size') > 1, -1, 2)
print (df)
a b c d id
0 10 2017 20.0 231 -1
1 10 2017 20.0 223 -1
2 20 2018 10.0 113 2
3 30 2017 11.0 134 -1
4 30 2017 11.0 112 -1
5 30 2017 11.0 111 -1
If want 1 and 0 values another solution is cast boolean mask to integers:
df['id'] = np.where(df.groupby(['a', 'b', 'c'])['a'].transform('size') > 1, 1, 0)
df['id'] = (df.groupby(['a', 'b', 'c'])['a'].transform('size') > 1).astype(int)
print (df)
a b c d id
0 10 2017 20.0 231 1
1 10 2017 20.0 223 1
2 20 2018 10.0 113 0
3 30 2017 11.0 134 1
4 30 2017 11.0 112 1
5 30 2017 11.0 111 1
EDIT I think need GroupBy.ngroup:
#create values by size of columns
df['id'] = df.groupby(['a', 'b', 'c'])['a'] .transform('size')
#filter out rows
df = df[df['id'] > 1]
#sequencial id values
df['id'] = df.groupby(['a', 'b', 'c'])['a'].ngroup() + 1
a b c d id
0 10 2017 20.0 231 1
1 10 2017 20.0 223 1
3 30 2017 11.0 134 2
4 30 2017 11.0 112 2
5 30 2017 11.0 111 2

updating dataframe with iterrows

I want to compute values in a dataframe doing it by rows with iterrows, as below:
df = pd.DataFrame([ list( range( 0, 6)) + [np.NaN] * 5,
list( range(10,16)) + [np.NaN] * 5,
list( range(20,26)) + [np.NaN] * 5,
list( range(30,36)) + [np.NaN] * 5])
for (index, row) in df.iterrows():
df.loc[ index, 6: 11] = row[ 1: 6] - row [ 0]
Why df is not updated ?
I even tried to replace row[ 1: 6] - row [ 0] with df.loc[ index, 1: 6] - df.loc[ index, 0] and it doesn't work. Is it a trivial mistake or more subtile concept I don't master ? And also is there something more performant ?
Pandas assignment with loc does index alignment before assignment. Your columns names will be misaligned here. Do this:
for (index, row) in df.iterrows():
df.loc[ index, 6: 11] = (row[ 1: 6] - row [ 0]).values
df
Out[23]:
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 3 4 5 1.0 2.0 3.0 4.0 5.0
1 10 11 12 13 14 15 1.0 2.0 3.0 4.0 5.0
2 20 21 22 23 24 25 1.0 2.0 3.0 4.0 5.0
3 30 31 32 33 34 35 1.0 2.0 3.0 4.0 5.0
Documentation here for more information:
Warning pandas aligns all AXES when setting Series and DataFrame from
.loc, .iloc and .ix. This will not modify df because the column
alignment is before value assignment.
You rarely ever need to iterate through a dataframe. I would just do this:
import pandas
import numpy
x = numpy.array([
list(range(0, 6)) + [numpy.NaN] * 5,
list(range(10, 16)) + [numpy.NaN] * 5,
list(range(20, 26)) + [numpy.NaN] * 5,
list(range(30, 36)) + [numpy.NaN] * 5
])
x[:, 6:] = x[:, 1:6] - x[:, [0]]
pandas.DataFrame(x)
Gives me:
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0
1 10.0 11.0 12.0 13.0 14.0 15.0 1.0 2.0 3.0 4.0 5.0
2 20.0 21.0 22.0 23.0 24.0 25.0 1.0 2.0 3.0 4.0 5.0
3 30.0 31.0 32.0 33.0 34.0 35.0 1.0 2.0 3.0 4.0 5.0
Thx. I added up the two solutions :
df = pd.DataFrame([ list( range( 0, 6)) + [np.NaN] * 5,
list( range(10,16)) + [np.NaN] * 5,
list( range(20,26)) + [np.NaN] * 5,
list( range(30,36)) + [np.NaN] * 5])
df.loc[ :, 6: 11] = (row[ 1: 6] - row [ 0]).values
df
Out[10]:
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 3 4 5 1.0 2.0 3.0 4.0 5.0
1 10 11 12 13 14 15 1.0 2.0 3.0 4.0 5.0
2 20 21 22 23 24 25 1.0 2.0 3.0 4.0 5.0
3 30 31 32 33 34 35 1.0 2.0 3.0 4.0 5.0
EDIT:
As a matter of fact this is not working! In my real example there is a problem and data is not what it should be looking at this small example.
The iterrows() solution is slow (my data frame is around 9000*500) so I'm going to numpy array solution. Converting the data frame to numpy array, doing the calculation and going back to data frame.
import numpy as np
import pandas as pd
df = pd.DataFrame([ list( range( 0, 6)) + [np.NaN] * 5,
list( range(10,16)) + [np.NaN] * 5,
list( range(20,26)) + [np.NaN] * 5,
list( range(30,36)) + [np.NaN] * 5])
x = df.as_matrix()
x[ :, 6:] = x[ :, 1: 6] - x[ :, [ 0]]
df = pd.DataFrame( x, columns=df.columns, index=df.index, dtype='int8')
df
Out[15]:
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 3 4 5 1 2 3 4 5
1 10 11 12 13 14 15 1 2 3 4 5
2 20 21 22 23 24 25 1 2 3 4 5
3 30 31 32 33 34 35 1 2 3 4 5
In [ ]:

Resources