Taking different records from groups using group by in pandas - python-3.x

Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.

Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64

Related

How to calculate the having statement in pandas dataframe [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Get OrderID with min score [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

count Total rows of an Id from another column

I have a dataframe
Intialise data of lists.
data = {'Id':['1', '2', '3', '4','5','6','7','8','9','10'], 'reply_id':[2, 2,2, 5,5,6,8,8,1,1]}
Create DataFrame
df = pd.DataFrame(data)
Id reply_id
0 1 2
1 2 2
2 3 2
3 4 5
4 5 5
5 6 6
6 7 8
7 8 8
8 9 1
9 10 1
I want to get total of reply_id in new for every Id.
Id=1 have 2 time occurrence in reply_id which i want in new column new
Desired output
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0
I have done this line of code.
df['new'] = df.reply_id.eq(df.Id).astype(int).groupby(df.Id).transform('sum')
In this answer, I used Series.value_counts to count values in reply_id, and converted the result to a dict. Then, I used Series.map on the Id column to associate counts to Id. fillna(0) is used to fill values not present in reply_id
df['new'] = (df['Id']
.astype(int)
.map(df['reply_id'].value_counts().to_dict())
.fillna(0)
.astype(int))
Use, Series.groupby on the column reply_id, then use the aggregation function GroupBy.count to create a mapping series counts, finally use Series.map to map the values in Id column with their respective counts:
counts = df['reply_id'].groupby(df['reply_id']).count()
df['new'] = df['Id'].map(counts).fillna(0).astype(int)
Result:
# print(df)
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0

How to check value change in column

in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0

Resources