I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
How can I perform aggregation with Pandas?
No DataFrame after aggregation! What happened?
How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
How can I aggregate counts?
How can I create a new column filled by aggregated values?
I've seen these recurring questions asking about various faces of the pandas aggregate functionality.
Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts.
The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next instalment in a series of helpful user-guides:
How to pivot a dataframe,
Pandas concat
How do I operate on a DataFrame with a Series for every column?
Pandas Merging 101
Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!
Question 1
How can I perform aggregation with Pandas?
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6),
'E' : np.random.randint(5, size=6)})
print (df)
A B C D E
0 foo one 2 3 0
1 foo two 4 1 0
2 bar three 2 1 1
3 foo two 1 0 3
4 bar two 3 1 4
5 foo one 2 1 0
Aggregation by filtered columns and Cython implemented functions:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:
df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
You can also specify only some columns used for aggregation in a list after the groupby function:
df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
A B C D
0 bar three 2 1
1 bar two 3 1
2 foo one 4 4
3 foo two 5 1
Same results by using function DataFrameGroupBy.agg:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:
df4 = (df.groupby(['A', 'B'])['C']
.agg([('average','mean'),('total','sum')])
.reset_index())
print (df4)
A B average total
0 bar three 2.0 2
1 bar two 3.0 3
2 foo one 2.0 4
3 foo two 2.5 5
If want to pass multiple functions is possible pass list of tuples:
df5 = (df.groupby(['A', 'B'])
.agg([('average','mean'),('total','sum')]))
print (df5)
C D E
average total average total average total
A B
bar three 2.0 2 1.0 1 1.0 1
two 3.0 3 1.0 1 4.0 4
foo one 2.0 4 2.0 4 0.0 0
two 2.5 5 0.5 1 1.5 3
Then get MultiIndex in columns:
print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
And for converting to columns, flattening MultiIndex use map with join:
df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:
df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
df5.columns = (df5.columns.map('_'.join)
.str.replace('sum','total')
.str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
If want specified each column with aggregated function separately pass dictionary:
df6 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D':'mean'})
.rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
A B C_total D_average
0 bar three 2 1.0
1 bar two 3 1.0
2 foo one 4 2.0
3 foo two 5 0.5
You can pass custom function too:
def func(x):
return x.iat[0] + x.iat[-1]
df7 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D': func})
.rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
A B C_total D_sum_first_and_last
0 bar three 2 2
1 bar two 3 2
2 foo one 4 4
3 foo two 5 1
Question 2
No DataFrame after aggregation! What happened?
Aggregation by two or more columns:
df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A B
bar three 2
two 3
foo one 4
two 5
Name: C, dtype: int32
First check the Index and type of a Pandas object:
print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
names=['A', 'B'])
print (type(df1))
<class 'pandas.core.series.Series'>
There are two solutions for how to get MultiIndex Series to columns:
add parameter as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
use Series.reset_index:
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
If group by one column:
df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar 5
foo 9
Name: C, dtype: int32
... get Series with Index:
print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')
print (type(df2))
<class 'pandas.core.series.Series'>
And the solution is the same like in the MultiIndex Series:
df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
A C
0 bar 5
1 foo 9
df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
A C
0 bar 5
1 foo 9
Question 3
How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
'D' : [1,2,3,2,3,1,2]})
print (df)
A B C D
0 a one three 1
1 c two one 2
2 b three two 3
3 b two two 2
4 a two three 3
5 c one two 1
6 b three one 2
Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:
df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
An alternative is use GroupBy.apply:
df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
For converting to strings with a separator, use .join only if it is a string column:
df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
A B
0 a one,two
1 b three,two,three
2 c two,one
If it is a numeric column, use a lambda function with astype for converting to strings:
df3 = (df.groupby('A')['D']
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
Another solution is converting to strings before groupby:
df3 = (df.assign(D = df['D'].astype(str))
.groupby('A')['D']
.agg(','.join).reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
For converting all columns, don't pass a list of column(s) after groupby.
There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.
df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
A B C
0 a one,two three,three
1 b three,two,three two,two,one
2 c two,one one,two
So it's necessary to convert all columns into strings, and then get all columns:
df5 = (df.groupby('A')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df5)
A B C D
0 a one,two three,three 1,3
1 b three,two,three two,two,one 3,2,2
2 c two,one one,two 2,1
Question 4
How can I aggregate counts?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
A B C D
0 a one three NaN
1 c two NaN 2.0
2 b three NaN 3.0
3 b two two 2.0
4 a two three 3.0
5 c one two NaN
6 b three one 2.0
Function GroupBy.size for size of each group:
df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
A COUNT
0 a 2
1 b 3
2 c 2
Function GroupBy.count excludes missing values:
df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
A COUNT
0 a 2
1 b 2
2 c 1
This function should be used for multiple columns for counting non-missing values:
df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
A B_COUNT C_COUNT D_COUNT
0 a 2 2 1
1 b 3 2 3
2 c 2 1 1
A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.
df4 = (df['A'].value_counts()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df4)
A COUNT
0 b 3
1 a 2
2 c 2
If you want same output like using function groupby + size, add Series.sort_index:
df5 = (df['A'].value_counts()
.sort_index()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df5)
A COUNT
0 a 2
1 b 3
2 c 2
Question 5
How can I create a new column filled by aggregated values?
Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.
See the Pandas documentation for more information.
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6)})
print (df)
A B C D
0 foo one 2 3
1 foo two 4 1
2 bar three 2 1
3 foo two 1 0
4 bar two 3 1
5 foo one 2 1
df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
print (df)
A B C D C1 C2 C3 D3 C4 D4
0 foo one 2 3 9 4 9 5 4 4
1 foo two 4 1 9 5 9 5 5 1
2 bar three 2 1 5 2 5 2 2 1
3 foo two 1 0 9 5 9 5 5 1
4 bar two 3 1 5 3 5 2 3 1
5 foo one 2 1 9 4 9 5 4 4
If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:
Let us first create a Pandas dataframe
import pandas as pd
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'key2' : ['c','c','d','d','e'],
'value1' : [1,2,2,3,3],
'value2' : [9,8,7,6,5]})
df.head(5)
Here is how the table we created looks like:
key1
key2
value1
value2
a
c
1
9
a
c
2
8
a
d
2
7
b
d
3
6
a
e
3
5
1. Aggregating With Row Reduction Similar to SQL Group By
1.1 If Pandas version >=0.25
Check your Pandas version by running print(pd.__version__). If your Pandas version is 0.25 or above then the following code will work:
df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
sum_of_value_2=('value2', 'sum'),
count_of_value1=('value1','size')
).reset_index()
df_agg.head(5)
The resulting data table will look like this:
key1
key2
mean_of_value1
sum_of_value2
count_of_value1
a
c
1.5
17
2
a
d
2.0
7
1
a
e
3.0
5
1
b
d
3.0
6
1
The SQL equivalent of this is:
SELECT
key1
,key2
,AVG(value1) AS mean_of_value_1
,SUM(value2) AS sum_of_value_2
,COUNT(*) AS count_of_value1
FROM
df
GROUP BY
key1
,key2
1.2 If Pandas version <0.25
If your Pandas version is older than 0.25 then running the above code will give you the following error:
TypeError: aggregate() missing 1 required positional argument: 'arg'
Now to do the aggregation for both value1 and value2, you will run this code:
df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})
df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]
df_agg.head(5)
The resulting table will look like this:
key1
key2
value1_mean
value1_count
value2_sum
a
c
1.5
2
17
a
d
2.0
1
7
a
e
3.0
1
5
b
d
3.0
1
6
Renaming the columns needs to be done separately using the below code:
df_agg.rename(columns={"value1_mean" : "mean_of_value1",
"value1_count" : "count_of_value1",
"value2_sum" : "sum_of_value2"
}, inplace=True)
2. Create a Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)
If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.
df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
df.head(5)
The resulting data frame will look like this with the same number of rows as the original:
key1
key2
value1
value2
Total_of_value1_by_key1
a
c
1
9
8
a
c
2
8
8
a
d
2
7
8
b
d
3
6
3
a
e
3
5
8
3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a rank column which is the SQL equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC).
Here is how you do that.
df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
.groupby(['key1']) \
.cumcount() + 1
df.head(5)
Note: we make the code multi-line by adding \ at the end of each line.
Here is how the resulting data frame looks like:
key1
key2
value1
value2
RN
a
c
1
9
4
a
c
2
8
3
a
d
2
7
2
b
d
3
6
1
a
e
3
5
1
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
Other aggregating operators:
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
I have df that looks like this:
a b c d e f
1 na 2 3 4 5
1 na 2 3 4 5
1 na 2 3 4 5
1 6 2 3 4 5
How do I trim and reshape the dataframe so that for every column the n/a are dropped and the dataframe looks like this:
Edit;
df.dropna() is dropping all the rows.
a b c d e f
1 6 2 3 4 5
This dataframe has millions of rows, I need to be able to drop the n/a rows by column while retaining rows and columns with data in them.
edit;
df.dropna() is dropping all the rows in the column. When I check if the columns with n/a are empty, df.column_name.empty() I get false. So there is data in columns with n/a
For me dropna working nice for remove missing values and Nones:
df = df.dropna()
print (df)
a b c d e f
3 1 6.0 2 3 4 5
But if possible multiple values for removing create mask by isin, chain testing missing values with isnull and last filter by any - return at least one True per row by inverted mask ~:
df = pd.DataFrame({'a': ['a', None, 's', 'd'],
'b': ['na',7, 2, 6],
'c': [2, 2, 2, 2],
'd': [3, 3, 3, 3],
'e': [4, 4, np.nan, 4],
'f': [5, 5, 5, 5]})
print (df)
a b c d e f
0 a na 2 3 4.0 5
1 None 7 2 3 4.0 5
2 s 2 2 3 NaN 5
3 d 6 2 3 4.0 5
df1 = df.dropna()
print (df1)
a b c d e f
0 a na 2 3 4.0 5
3 d 6 2 3 4.0 5
mask = (df.isin(['na', 'n/a']) | df.isnull()).any(axis=1)
df2 = df[~mask]
print (df2)
a b c d e f
3 d 6 2 3 4.0 5