Agg and groupby by specific condition - python-3.x

I have this dataframe:
index = [1, 2, 3, 4, 5, 6, 7, 8]
a = [1247, 1247, 1539, 1247, 1539, 1539, 1539, 1247]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_C', 'Group_B', 'Group_B', 'Group_C', 'Group_B']
c = [np.nan, 23, 30, 27, np.nan, 42, 40, 62]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold': c})
Now I want to calculate the number of units sold for both A and B and group by ID. The result should look like this:
ID Sum_AB Sum_C
0 1247 85.0 27.0
1 1539 72.0 40.0

Use series.replace to replace the Group column and assign with groupby() and unstack:
(df.assign(Group=df['Group'].replace(['A','B'],['AB','AB'],regex=True))
.groupby(['ID','Group'],sort=False)['Unit_sold'].sum().unstack()
.add_suffix('_sum').reset_index().rename_axis(None,axis=1))
ID Group_AB_sum Group_C_sum
0 1247 85.0 27.0
1 1539 72.0 40.0

using np.where and pd.crosstab
df['Group'] = np.where(df['Group'].isin(['Group_A','Group_B']),'Sum_AB','Sum_C')
df2 = pd.crosstab(df.ID,df.Group,df.Unit_sold,aggfunc='sum').reset_index()
print(df2)
Group ID Sum_AB Sum_C
0 1247 85.0 27.0
1 1539 72.0 40.0

Related

Pandas resample fill NaN

I have this df:
Timestamp List Power Energy Status
0 2020-01-01 01:05:50 [5, 5, 5] 7000 15000 online
1 2020-01-01 01:06:20 [6, 6, 6] 7500 16000 online
2 2020-01-01 01:08:30 [0, 0, 0] 5 0 offline
...
no i want to resample it. Use .resample as following:
df2 = df.set_index('timestamp').resample('min').?
i want the df in 1min - intervalls. To each intervall i want to match with the rows as follows:
List: if status = online: last entry of the intervall else '0';
Power: if status = online: the mean value of the intervall else '0'; Energy: if status = online: last entry of the intervall else '0; Status: the last status of the intervall;
how do i fill the NaN values, which .resample outputs, if there is no data in df? E.g. no data for an interval, then the df should be filled as follows Power = 0; Energy = 0; status = offline;...
I tried something like that:
df2 = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
and got:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 NaN NaN NaN NaN
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
Expected outcome:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 [0, 0, 0] 0 0 offline
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
There is no way to pass fillna rule to separately handle each column NA values during .resample().agg() as viewed in docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
In your case even interpolation does not work, so, try to manually handle each column NA values
Firstly, let's initialize your sample frame.
import pandas as pd
data = {"Timestamp":{"0": "2020-01-01 01:05:50",
"1": "2020-01-01 01:06:20",
"2": "2020-01-01 01:08:30"},
"List": {"0": [5, 5, 5],
"1": [6, 6, 6],
"2": [0, 0, 0]},
"Power": {"0": 7000,
"1": 7500,
"2": 5},
"Energy": {"0": 15000,
"1": 16000,
"2": 0},
"Status": {"0": "online",
"1": "online",
"2": "offline"},
}
df = pd.DataFrame(data)
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
Now we can manually replace NA in each column separately
df["List"] = df["List"].fillna("[0, 0, 0]")
df["Status"] = df["Status"].fillna('offline')
df = df.fillna(0)
or more convenient dict way to do it
values = {
'List': '[0, 0, 0]',
'Status': 'offline',
'Power': 0,
'Energy': 0
}
df = df.fillna(value=values)
Timestamp List Power Energy Status
0 2020-01-01 01:05:00 [5, 5, 5] 7000.0 15000.0 online
1 2020-01-01 01:06:00 [6, 6, 6] 7500.0 16000.0 online
2 2020-01-01 01:07:00 [0, 0, 0] 0.0 0.0 offline
3 2020-01-01 01:08:00 [0, 0, 0] 5.0 0.0 offline

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Group by and aggregate pandas

I have this data frame and given the data for each columns:
index = [1, 2, 3, 4, 5, 6, 7]
a = [1247, 1247, 1247, 1247, 1539, 1539, 1539]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_B', 'Group_B', 'Group_B', 'Group_A']
c = [np.nan, 23, 30, 27, 18, 42, 40]
d = [50, 51, 67, np.nan, 44, 37, 49]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold_1': c, 'Unit_sold_2':d})
If I want to sum the Unit_sold for each ID, I could use these code:
df.groupby(df['ID']).agg({'Unit_sold_1':'sum', 'Unit_sold_2':'sum'})
But what should I code if I want to group them by ID and then by Group. The result looks like this:
ID Group_A_sold_1 Group_B_sold_1 Group_A_sold_2 Group_B_sold_2
0 1247 23 57 101 67
1 1539 40 60 49 81
Do it with pivot_table then columns merge
s=df.pivot_table(index='ID',columns='Group',values=['Unit_sold_1','Unit_sold_2'],aggfunc='sum')
s.columns=s.columns.map('_'.join)
s.reset_index(inplace=True)
Unit_sold_1_Group_A ... Unit_sold_2_Group_B
ID ...
1247 23.0 ... 67.0
1539 40.0 ... 81.0
[2 rows x 4 columns]

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Updating values in a pandas dataframe using another dataframe

I have an existing pandas Dataframe with the following format:
sample_dict = {'ID': [100, 200, 300], 'a': [1, 2, 3], 'b': [.1, .2, .3], 'c': [4, 5, 6], 'd': [.4, .5, .6]}
df_sample = pd.DataFrame(sample_dict)
Now, I want to update df_sample using another dataframe that looks like this:
sample_update = {'ID': [100, 300], 'a': [3, 2], 'b': [.4, .2], 'c': [2, 5], 'd': [.7, .1]}
df_updater = pd.DataFrame(sample_update)
The rule for the update is this:
For column a and c, just add values from a and c in df_updater.
For column b, it depends on the updated value of a. Let's say the update function would be b = old_b + (new_b / updated_a).
For column d, the rules are similar to that of column b except that it depends on values of the updated c and new_d.
Here is the desired output:
new = {'ID': [100, 200, 300], 'a': [4, 2, 5], 'b': [.233333, .2, .33999999], 'c': [6, 5, 11], 'd': [.51666666, .5, .609090]}
df_new = pd.DataFrame(new)
My actual problems are using a little more complicated version of this but I think this example is enough to solve my problem. Also, In my real DataFrame, I have more columns following the same rules so I would like to make this method to loop over the columns if possible. Thanks!
You can use functions merge, add and div:
df = pd.merge(df_sample,df_updater,on='ID', how='left')
df[['a','c']] = df[['a_y','c_y']].add(df[['a_x','c_x']].values, fill_value=0)
df['b'] = df['b_x'].add(df['b_y'].div(df.a_y), fill_value=0)
df['d'] = df['c_x'].add(df['d_y'].div(df.c_y), fill_value=0)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y a c b d
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7 4.0 6.0 0.233333 4.35
1 200 2 0.2 5 0.5 NaN NaN NaN NaN 2.0 5.0 0.200000 5.00
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1 5.0 11.0 0.400000 6.02
print (df[['a','b','c','d']])
a b c d
0 4.0 0.233333 6.0 4.35
1 2.0 0.200000 5.0 5.00
2 5.0 0.400000 11.0 6.02
Instead merge is posible use concat:
df=pd.concat([df_sample.set_index('ID'),df_updater.set_index('ID')], axis=1,keys=('_x','_y'))
df.columns = [''.join((col[1], col[0])) for col in df.columns]
df.reset_index(inplace=True)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7
1 200 2 0.2 5 0.5 NaN NaN NaN NaN
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1

Resources