Pandas: GroupBy Shift And Cumulative Sum - python-3.x

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.

You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation

You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0

While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

Related

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Working with two data frames with different size in python

I am working with two data frames.
The sample data is as follow:
DF = ['A','B','C','D','E','A','C','B','B']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
i want to add a new column to DF1 as follow:
Team Rating
A 1
B 2
C 3
D 4
E 5
A 1
C 3
B 2
B 2
How can I add a new column?
I used
DF1['Rating']= np.where(DF1['Team']== DF2['Team'],DF2['Rating'],0)
Error : ValueError: Can only compare identically-labeled Series objects
Thanks
ZEP
I think need map by Series created with set_index and if not match get NaNs, so fillna was added for replace to 0:
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
DF = ['A','B','C','D','E','A','C','B','B', 'G']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 A 1.0
6 C 3.0
7 B 2.0
8 B 2.0
9 G 0.0 <- G not in DF2['Team']
Detail:
print (DF1['Team'].map(DF2.set_index('Team')['Rating']))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 1.0
6 3.0
7 2.0
8 2.0
9 NaN
Name: Team, dtype: float64
You can use:
In [54]: DF1['new_col'] = DF1.Team.map(DF2.set_index('Team').Rating)
In [55]: DF1
Out[55]:
Team new_col
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
i think you can use pd.merge
DF1=pd.merge(DF1,DF2,how='left',on='Team')
DF1
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

Pandas dataframe: Number of dates in group prior to row date

I would like to add a column in a dataframe that contains for each group G the number of distinct observations in variable x that happened before time t.
Note: t is in datetime format and missing values in the data are possible but can be ignored. The same x can appear multiple times in a group but then it is assigned the same date. The time assigned to x is not the same across groups.
I hope this example helps:
Input:
Group x t
1 a 2013-11-01
1 b 2015-04-03
1 b 2015-04-03
1 c NaT
2 a 2017-03-01
2 c 2013-11-06
2 d 2015-04-26
2 d 2015-04-26
2 d 2015-04-26
2 b NaT
Output:
Group x t Number of unique x before time t
1 a 2013-11-01 0
1 b 2015-04-03 1
1 b 2015-04-03 1
1 c NaT NaN
2 a 2017-03-01 2
2 c 2013-11-06 0
2 d 2015-04-26 1
2 d 2015-04-26 1
2 d 2015-04-26 1
2 b NaT NaN
The dataset is quite large so I wonder if there is any vectorized way do this (e.g. using groupby).
Many Thanks
Here's another method.
The initial sort makes it so fillna will work later on.
Create df2, which calculates the unique number of days within each group before that date.
Merge the number of days back to the original df. fillna then takes care of the days which were duplicated (the sort ensures this happens properly)
Dates with NaT were placed at the end for the cumsum so just reset them to NaN
If you want to reorder at the end to the original order, just sort the index df.sort_index(inplace=True)
import pandas as pd
import numpy as np
df = df.sort_values(by=['Group', 't'])
df['t'] = pd.to_datetime(df.t)
df2 = df
df2 = df2[df2.t.notnull()]
df2 = df2.drop_duplicates()
df2['temp'] = 1
df2['num_b4'] = df2.groupby('Group').temp.cumsum()-1
df = df.merge(df2[['num_b4']], left_index=True, right_index=True, how='left')
df['num_b4'] = df['num_b4'].fillna(method='ffill')
df.loc[df.t.isnull(), 'num_b4'] = np.NaN
# Group x t num_b4
#0 1 a 2013-11-01 0.0
#1 1 b 2015-04-03 1.0
#2 1 b 2015-04-03 1.0
#3 1 c NaT NaN
#5 2 c 2013-11-06 0.0
#6 2 d 2015-04-26 1.0
#7 2 d 2015-04-26 1.0
#8 2 d 2015-04-26 1.0
#4 2 a 2017-03-01 2.0
#9 2 b NaT NaN
IIUUC for the new cases, you want to change a single line in the above code.
# df2 = df2.drop_duplicates()
df2 = df2.drop_duplicates(['Group', 't'])
With that, the same day that has multiple x values assigned to it does not cause the number of observations to increment. See the output for Group 3 below, in which I added 4 rows to your initial data.
Group x t
3 a 2015-04-03
3 b 2015-04-03
3 c 2015-04-03
3 c 2015-04-04
## Apply the Code changing the drop_duplicates() line
Group x t num_b4
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
5 2 c 2013-11-06 0.0
6 2 d 2015-04-26 1.0
7 2 d 2015-04-26 1.0
8 2 d 2015-04-26 1.0
4 2 a 2017-03-01 2.0
9 2 b NaT NaN
10 3 a 2015-04-03 0.0
11 3 b 2015-04-03 0.0
12 3 c 2015-04-03 0.0
13 3 c 2015-04-04 1.0
Can you can do it like this using a custom designed function using merge to do a self-join, groupby and nunique to count unique values:
def countunique(x):
df_out = x.merge(x, on='Group')\
.query('x_x != x_y and t_y < t_x')\
.groupby(['x_x','t_x'])['x_y'].nunique()\
.reset_index()
result = x.merge(df_out, left_on=['x','t'],
right_on=['x_x','t_x'],
how='left')
result = result[['Group','x','t','x_y']]
result.loc[result.t.notnull(),'x_y'] = result.loc[result.t.notnull(),'x_y'].fillna(0)
return result.rename(columns={'x_y':'No of unique x before t'})
df.groupby('Group', group_keys=False).apply(countunique)
Output:
Group x t No of unique x before t
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
0 2 a 2017-03-01 2.0
1 2 c 2013-11-06 0.0
2 2 d 2015-04-26 1.0
3 2 d 2015-04-26 1.0
4 2 d 2015-04-26 1.0
5 2 b NaT NaN
Explanation:
For each group,
Perform a self-join using merge on 'Group'
Filter result of self join only getting those time before the
current record.
Use groupby with nunique to count only unique values of x from
self-join.
Merge count of x back to the original dataframe keep all rows using
how='left'
Fill NaN values with zero where there is time on a recourd
Rename column headings

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

pandas equivalent of R's cbind (concatenate/stack vectors vertically)

suppose I have two dataframes:
import pandas
....
....
test1 = pandas.DataFrame([1,2,3,4,5])
....
....
test2 = pandas.DataFrame([4,2,1,3,7])
....
I tried test1.append(test2) but it is the equivalent of R's rbind.
How can I combine the two as two columns of a dataframe similar to the cbind function in R?
test3 = pd.concat([test1, test2], axis=1)
test3.columns = ['a','b']
(But see the detailed answer by #feng-mai, below)
There is a key difference between concat(axis = 1) in pandas and cbind() in R:
concat attempts to merge/align by index. There is no concept of index in a R dataframe. If the two pandas dataframes' indexes are misaligned, the results are different from cbind (even if they have the same number of rows). You need to either make sure the indexes align or drop/reset the indexes.
Example:
import pandas as pd
test1 = pd.DataFrame([1,2,3,4,5])
test1.index = ['a','b','c','d','e']
test2 = pd.DataFrame([4,2,1,3,7])
test2.index = ['d','e','f','g','h']
pd.concat([test1, test2], axis=1)
0 0
a 1.0 NaN
b 2.0 NaN
c 3.0 NaN
d 4.0 4.0
e 5.0 2.0
f NaN 1.0
g NaN 3.0
h NaN 7.0
pd.concat([test1.reset_index(drop=True), test2.reset_index(drop=True)], axis=1)
0 1
0 1 4
1 2 2
2 3 1
3 4 3
4 5 7
pd.concat([test1.reset_index(), test2.reset_index(drop=True)], axis=1)
index 0 0
0 a 1 4
1 b 2 2
2 c 3 1
3 d 4 3
4 e 5 7

Resources