I have this dataframe:
index = [1, 2, 3, 4, 5, 6, 7, 8]
a = [1247, 1247, 1539, 1247, 1539, 1539, 1539, 1247]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_C', 'Group_B', 'Group_B', 'Group_C', 'Group_B']
c = [np.nan, 23, 30, 27, np.nan, 42, 40, 62]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold': c})
Now I want to calculate the number of units sold for both A and B and group by ID. The result should look like this:
ID Sum_AB Sum_C
0 1247 85.0 27.0
1 1539 72.0 40.0

Use series.replace to replace the Group column and assign with groupby() and unstack:
ID Group_AB_sum Group_C_sum
0 1247 85.0 27.0
1 1539 72.0 40.0

using np.where and pd.crosstab
df['Group'] = np.where(df['Group'].isin(['Group_A','Group_B']),'Sum_AB','Sum_C')
df2 = pd.crosstab(df.ID,df.Group,df.Unit_sold,aggfunc='sum').reset_index()
Group ID Sum_AB Sum_C
0 1247 85.0 27.0
1 1539 72.0 40.0


Pandas resample fill NaN

I have this df:
Timestamp List Power Energy Status
0 2020-01-01 01:05:50 [5, 5, 5] 7000 15000 online
1 2020-01-01 01:06:20 [6, 6, 6] 7500 16000 online
2 2020-01-01 01:08:30 [0, 0, 0] 5 0 offline
no i want to resample it. Use .resample as following:
df2 = df.set_index('timestamp').resample('min').?
i want the df in 1min - intervalls. To each intervall i want to match with the rows as follows:
List: if status = online: last entry of the intervall else '0';
Power: if status = online: the mean value of the intervall else '0'; Energy: if status = online: last entry of the intervall else '0; Status: the last status of the intervall;
how do i fill the NaN values, which .resample outputs, if there is no data in df? E.g. no data for an interval, then the df should be filled as follows Power = 0; Energy = 0; status = offline;...
I tried something like that:
df2 = df.set_index('Timestamp').resample('T').agg({'List':'last',
and got:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 NaN NaN NaN NaN
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
Expected outcome:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 [0, 0, 0] 0 0 offline
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
There is no way to pass fillna rule to separately handle each column NA values during .resample().agg() as viewed in docs
In your case even interpolation does not work, so, try to manually handle each column NA values
Firstly, let's initialize your sample frame.
import pandas as pd
data = {"Timestamp":{"0": "2020-01-01 01:05:50",
"1": "2020-01-01 01:06:20",
"2": "2020-01-01 01:08:30"},
"List": {"0": [5, 5, 5],
"1": [6, 6, 6],
"2": [0, 0, 0]},
"Power": {"0": 7000,
"1": 7500,
"2": 5},
"Energy": {"0": 15000,
"1": 16000,
"2": 0},
"Status": {"0": "online",
"1": "online",
"2": "offline"},
df = pd.DataFrame(data)
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp').resample('T').agg({'List':'last',
Now we can manually replace NA in each column separately
df["List"] = df["List"].fillna("[0, 0, 0]")
df["Status"] = df["Status"].fillna('offline')
df = df.fillna(0)
or more convenient dict way to do it
values = {
'List': '[0, 0, 0]',
'Status': 'offline',
'Power': 0,
'Energy': 0
df = df.fillna(value=values)
Timestamp List Power Energy Status
0 2020-01-01 01:05:00 [5, 5, 5] 7000.0 15000.0 online
1 2020-01-01 01:06:00 [6, 6, 6] 7500.0 16000.0 online
2 2020-01-01 01:07:00 [0, 0, 0] 0.0 0.0 offline
3 2020-01-01 01:08:00 [0, 0, 0] 5.0 0.0 offline

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Group by and aggregate pandas

I have this data frame and given the data for each columns:
index = [1, 2, 3, 4, 5, 6, 7]
a = [1247, 1247, 1247, 1247, 1539, 1539, 1539]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_B', 'Group_B', 'Group_B', 'Group_A']
c = [np.nan, 23, 30, 27, 18, 42, 40]
d = [50, 51, 67, np.nan, 44, 37, 49]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold_1': c, 'Unit_sold_2':d})
If I want to sum the Unit_sold for each ID, I could use these code:
df.groupby(df['ID']).agg({'Unit_sold_1':'sum', 'Unit_sold_2':'sum'})
But what should I code if I want to group them by ID and then by Group. The result looks like this:
ID Group_A_sold_1 Group_B_sold_1 Group_A_sold_2 Group_B_sold_2
0 1247 23 57 101 67
1 1539 40 60 49 81
Do it with pivot_table then columns merge
Unit_sold_1_Group_A ... Unit_sold_2_Group_B
ID ...
1247 23.0 ... 67.0
1539 40.0 ... 81.0
[2 rows x 4 columns]

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)

Updating values in a pandas dataframe using another dataframe

I have an existing pandas Dataframe with the following format:
sample_dict = {'ID': [100, 200, 300], 'a': [1, 2, 3], 'b': [.1, .2, .3], 'c': [4, 5, 6], 'd': [.4, .5, .6]}
df_sample = pd.DataFrame(sample_dict)
Now, I want to update df_sample using another dataframe that looks like this:
sample_update = {'ID': [100, 300], 'a': [3, 2], 'b': [.4, .2], 'c': [2, 5], 'd': [.7, .1]}
df_updater = pd.DataFrame(sample_update)
The rule for the update is this:
For column a and c, just add values from a and c in df_updater.
For column b, it depends on the updated value of a. Let's say the update function would be b = old_b + (new_b / updated_a).
For column d, the rules are similar to that of column b except that it depends on values of the updated c and new_d.
Here is the desired output:
new = {'ID': [100, 200, 300], 'a': [4, 2, 5], 'b': [.233333, .2, .33999999], 'c': [6, 5, 11], 'd': [.51666666, .5, .609090]}
df_new = pd.DataFrame(new)
My actual problems are using a little more complicated version of this but I think this example is enough to solve my problem. Also, In my real DataFrame, I have more columns following the same rules so I would like to make this method to loop over the columns if possible. Thanks!
You can use functions merge, add and div:
df = pd.merge(df_sample,df_updater,on='ID', how='left')
df[['a','c']] = df[['a_y','c_y']].add(df[['a_x','c_x']].values, fill_value=0)
df['b'] = df['b_x'].add(df['b_y'].div(df.a_y), fill_value=0)
df['d'] = df['c_x'].add(df['d_y'].div(df.c_y), fill_value=0)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y a c b d
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7 4.0 6.0 0.233333 4.35
1 200 2 0.2 5 0.5 NaN NaN NaN NaN 2.0 5.0 0.200000 5.00
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1 5.0 11.0 0.400000 6.02
print (df[['a','b','c','d']])
a b c d
0 4.0 0.233333 6.0 4.35
1 2.0 0.200000 5.0 5.00
2 5.0 0.400000 11.0 6.02
Instead merge is posible use concat:
df=pd.concat([df_sample.set_index('ID'),df_updater.set_index('ID')], axis=1,keys=('_x','_y'))
df.columns = [''.join((col[1], col[0])) for col in df.columns]
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7
1 200 2 0.2 5 0.5 NaN NaN NaN NaN
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1
