Pandas Conditional Groupby Count Part 2 - python-3.x

Given this problem:
Pandas conditional groupby count
I would like the result to be this instead:
A D Dcount
0 foo 2 2
1 foo 4 2
2 foo 4 2
3 foo 2 2
4 bar 5 NaN
5 bar 4 NaN
6 bar 3 NaN
7 bar 2 NaN
What I mean is, if 2 conditions are met (column A = 'foo' and column B = 2),
I'd like for there to be the distinct count of such rows (2) in the Dcount column for all rows of column A = 'foo'.
Can this be modified to allow for the desired result?
import pandas as pd
df = pd.DataFrame(
{'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'D' : [2, 4, 4, 2, 5, 4, 3, 2]})
#First, I filter
df2=df.loc[(df['A']=='foo')&(df['D']==2)]
#Then, I use groupby and lambda x to count
df['Dcount']=df2.groupby(['D'])['D'].transform(lambda x: x.count())
df
Thanks in advance!

You can use where from numpy in a one-liner:
import numpy as np
df['Dcount'] = np.where(df['A']=='foo', sum((df.A=='foo') & (df.D==2)), np.NaN)
#In [34]: df
#Out[34]:
# A D Dcount
#0 foo 2 2
#1 foo 4 2
#2 foo 4 2
#3 foo 2 2
#4 bar 5 NaN
#5 bar 4 NaN
#6 bar 3 NaN
#7 bar 2 NaN

Related

How to map/replace multiple values in a column for each row in pandas dataframe

I have this sample
col1 result
1 A
1,2,3
2 B
2,3,4
3,4
4 D
1,3,4
3 C
Here's my map variable.
vals_to_replace = {'1':'A', '2':'B', '3':'C' , '4':'D'}
I map this to col1, and only getting some values from the col result, not sure why why single value got mapped only.
Any ideas on how to solve it?
Thanks
Maybe this is what works for you:
import pandas as pd
df = pd.DataFrame({'col1': ['1', '1,2,3', '2', '2,3,4', '3, 4', '4', '1,3,4', '3']})
translation = {'1':'A', '2':'B', '3':'C' , '4':'D'}
df['result'] = df.col1.str.translate(str.maketrans(translation))
print(df)
Result:
col1 result
0 1 A
1 1,2,3 A,B,C
2 2 B
3 2,3,4 B,C,D
4 3, 4 C, D
5 4 D
6 1,3,4 A,C,D
7 3 C

Select row by max of a column Pandas Python [duplicate]

How can I perform aggregation with Pandas?
No DataFrame after aggregation! What happened?
How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
How can I aggregate counts?
How can I create a new column filled by aggregated values?
I've seen these recurring questions asking about various faces of the pandas aggregate functionality.
Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts.
The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next instalment in a series of helpful user-guides:
How to pivot a dataframe,
Pandas concat
How do I operate on a DataFrame with a Series for every column?
Pandas Merging 101
Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!
Question 1
How can I perform aggregation with Pandas?
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6),
'E' : np.random.randint(5, size=6)})
print (df)
A B C D E
0 foo one 2 3 0
1 foo two 4 1 0
2 bar three 2 1 1
3 foo two 1 0 3
4 bar two 3 1 4
5 foo one 2 1 0
Aggregation by filtered columns and Cython implemented functions:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:
df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
You can also specify only some columns used for aggregation in a list after the groupby function:
df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
A B C D
0 bar three 2 1
1 bar two 3 1
2 foo one 4 4
3 foo two 5 1
Same results by using function DataFrameGroupBy.agg:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:
df4 = (df.groupby(['A', 'B'])['C']
.agg([('average','mean'),('total','sum')])
.reset_index())
print (df4)
A B average total
0 bar three 2.0 2
1 bar two 3.0 3
2 foo one 2.0 4
3 foo two 2.5 5
If want to pass multiple functions is possible pass list of tuples:
df5 = (df.groupby(['A', 'B'])
.agg([('average','mean'),('total','sum')]))
print (df5)
C D E
average total average total average total
A B
bar three 2.0 2 1.0 1 1.0 1
two 3.0 3 1.0 1 4.0 4
foo one 2.0 4 2.0 4 0.0 0
two 2.5 5 0.5 1 1.5 3
Then get MultiIndex in columns:
print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
And for converting to columns, flattening MultiIndex use map with join:
df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:
df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
df5.columns = (df5.columns.map('_'.join)
.str.replace('sum','total')
.str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
If want specified each column with aggregated function separately pass dictionary:
df6 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D':'mean'})
.rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
A B C_total D_average
0 bar three 2 1.0
1 bar two 3 1.0
2 foo one 4 2.0
3 foo two 5 0.5
You can pass custom function too:
def func(x):
return x.iat[0] + x.iat[-1]
df7 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D': func})
.rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
A B C_total D_sum_first_and_last
0 bar three 2 2
1 bar two 3 2
2 foo one 4 4
3 foo two 5 1
Question 2
No DataFrame after aggregation! What happened?
Aggregation by two or more columns:
df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A B
bar three 2
two 3
foo one 4
two 5
Name: C, dtype: int32
First check the Index and type of a Pandas object:
print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
names=['A', 'B'])
print (type(df1))
<class 'pandas.core.series.Series'>
There are two solutions for how to get MultiIndex Series to columns:
add parameter as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
use Series.reset_index:
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
If group by one column:
df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar 5
foo 9
Name: C, dtype: int32
... get Series with Index:
print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')
print (type(df2))
<class 'pandas.core.series.Series'>
And the solution is the same like in the MultiIndex Series:
df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
A C
0 bar 5
1 foo 9
df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
A C
0 bar 5
1 foo 9
Question 3
How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
'D' : [1,2,3,2,3,1,2]})
print (df)
A B C D
0 a one three 1
1 c two one 2
2 b three two 3
3 b two two 2
4 a two three 3
5 c one two 1
6 b three one 2
Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:
df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
An alternative is use GroupBy.apply:
df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
For converting to strings with a separator, use .join only if it is a string column:
df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
A B
0 a one,two
1 b three,two,three
2 c two,one
If it is a numeric column, use a lambda function with astype for converting to strings:
df3 = (df.groupby('A')['D']
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
Another solution is converting to strings before groupby:
df3 = (df.assign(D = df['D'].astype(str))
.groupby('A')['D']
.agg(','.join).reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
For converting all columns, don't pass a list of column(s) after groupby.
There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.
df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
A B C
0 a one,two three,three
1 b three,two,three two,two,one
2 c two,one one,two
So it's necessary to convert all columns into strings, and then get all columns:
df5 = (df.groupby('A')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df5)
A B C D
0 a one,two three,three 1,3
1 b three,two,three two,two,one 3,2,2
2 c two,one one,two 2,1
Question 4
How can I aggregate counts?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
A B C D
0 a one three NaN
1 c two NaN 2.0
2 b three NaN 3.0
3 b two two 2.0
4 a two three 3.0
5 c one two NaN
6 b three one 2.0
Function GroupBy.size for size of each group:
df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
A COUNT
0 a 2
1 b 3
2 c 2
Function GroupBy.count excludes missing values:
df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
A COUNT
0 a 2
1 b 2
2 c 1
This function should be used for multiple columns for counting non-missing values:
df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
A B_COUNT C_COUNT D_COUNT
0 a 2 2 1
1 b 3 2 3
2 c 2 1 1
A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.
df4 = (df['A'].value_counts()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df4)
A COUNT
0 b 3
1 a 2
2 c 2
If you want same output like using function groupby + size, add Series.sort_index:
df5 = (df['A'].value_counts()
.sort_index()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df5)
A COUNT
0 a 2
1 b 3
2 c 2
Question 5
How can I create a new column filled by aggregated values?
Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.
See the Pandas documentation for more information.
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6)})
print (df)
A B C D
0 foo one 2 3
1 foo two 4 1
2 bar three 2 1
3 foo two 1 0
4 bar two 3 1
5 foo one 2 1
df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
print (df)
A B C D C1 C2 C3 D3 C4 D4
0 foo one 2 3 9 4 9 5 4 4
1 foo two 4 1 9 5 9 5 5 1
2 bar three 2 1 5 2 5 2 2 1
3 foo two 1 0 9 5 9 5 5 1
4 bar two 3 1 5 3 5 2 3 1
5 foo one 2 1 9 4 9 5 4 4
If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:
Let us first create a Pandas dataframe
import pandas as pd
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'key2' : ['c','c','d','d','e'],
'value1' : [1,2,2,3,3],
'value2' : [9,8,7,6,5]})
df.head(5)
Here is how the table we created looks like:
key1
key2
value1
value2
a
c
1
9
a
c
2
8
a
d
2
7
b
d
3
6
a
e
3
5
1. Aggregating With Row Reduction Similar to SQL Group By
1.1 If Pandas version >=0.25
Check your Pandas version by running print(pd.__version__). If your Pandas version is 0.25 or above then the following code will work:
df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
sum_of_value_2=('value2', 'sum'),
count_of_value1=('value1','size')
).reset_index()
df_agg.head(5)
The resulting data table will look like this:
key1
key2
mean_of_value1
sum_of_value2
count_of_value1
a
c
1.5
17
2
a
d
2.0
7
1
a
e
3.0
5
1
b
d
3.0
6
1
The SQL equivalent of this is:
SELECT
key1
,key2
,AVG(value1) AS mean_of_value_1
,SUM(value2) AS sum_of_value_2
,COUNT(*) AS count_of_value1
FROM
df
GROUP BY
key1
,key2
1.2 If Pandas version <0.25
If your Pandas version is older than 0.25 then running the above code will give you the following error:
TypeError: aggregate() missing 1 required positional argument: 'arg'
Now to do the aggregation for both value1 and value2, you will run this code:
df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})
df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]
df_agg.head(5)
The resulting table will look like this:
key1
key2
value1_mean
value1_count
value2_sum
a
c
1.5
2
17
a
d
2.0
1
7
a
e
3.0
1
5
b
d
3.0
1
6
Renaming the columns needs to be done separately using the below code:
df_agg.rename(columns={"value1_mean" : "mean_of_value1",
"value1_count" : "count_of_value1",
"value2_sum" : "sum_of_value2"
}, inplace=True)
2. Create a Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)
If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.
df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
df.head(5)
The resulting data frame will look like this with the same number of rows as the original:
key1
key2
value1
value2
Total_of_value1_by_key1
a
c
1
9
8
a
c
2
8
8
a
d
2
7
8
b
d
3
6
3
a
e
3
5
8
3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a rank column which is the SQL equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC).
Here is how you do that.
df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
.groupby(['key1']) \
.cumcount() + 1
df.head(5)
Note: we make the code multi-line by adding \ at the end of each line.
Here is how the resulting data frame looks like:
key1
key2
value1
value2
RN
a
c
1
9
4
a
c
2
8
3
a
d
2
7
2
b
d
3
6
1
a
e
3
5
1
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
Other aggregating operators:
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values

Pandas aggregate column and keep header

I have code which works but gives me data without header is there a way I can write this code so header is not removed? I know one way will be to add back header, but is there a better way?
My code:
df = pd.read_csv(“_data.csv",skiprows=[0], header=None)
df = df.groupby([2])[10].sum().astype(float)
Data:
A B
1 2
1 1
2 3
2 4
I have data like above trying to get this result:
A B
1 3
2 7
Try to use the function reset_index after the sum:
data = [{'a': 1, 'b': 2},{'a': 1, 'b': 1},{'a': 2, 'b': 3},{'a': 2, 'b': 4}]
df = pd.DataFrame(data)
df
a b
0 1 2
1 1 1
2 2 3
3 2 4
df.groupby('a').sum().reset_index()
a b
0 1 3
1 2 7
You should specify the separator (several spaces in your case) and that the header is the first row (=0, with python indexing), than groupby the column you want.
df = pd.read_csv("_data.csv", sep='\s*', header=0)
A B
0 1 2
1 1 1
2 2 3
3 2 4
df = df.groupby(['A']).sum()
B
A
1 3
2 7

How to trim and reshape dataframe?

I have df that looks like this:
a b c d e f
1 na 2 3 4 5
1 na 2 3 4 5
1 na 2 3 4 5
1 6 2 3 4 5
How do I trim and reshape the dataframe so that for every column the n/a are dropped and the dataframe looks like this:
Edit;
df.dropna() is dropping all the rows.
a b c d e f
1 6 2 3 4 5
This dataframe has millions of rows, I need to be able to drop the n/a rows by column while retaining rows and columns with data in them.
edit;
df.dropna() is dropping all the rows in the column. When I check if the columns with n/a are empty, df.column_name.empty() I get false. So there is data in columns with n/a
For me dropna working nice for remove missing values and Nones:
df = df.dropna()
print (df)
a b c d e f
3 1 6.0 2 3 4 5
But if possible multiple values for removing create mask by isin, chain testing missing values with isnull and last filter by any - return at least one True per row by inverted mask ~:
df = pd.DataFrame({'a': ['a', None, 's', 'd'],
'b': ['na',7, 2, 6],
'c': [2, 2, 2, 2],
'd': [3, 3, 3, 3],
'e': [4, 4, np.nan, 4],
'f': [5, 5, 5, 5]})
print (df)
a b c d e f
0 a na 2 3 4.0 5
1 None 7 2 3 4.0 5
2 s 2 2 3 NaN 5
3 d 6 2 3 4.0 5
df1 = df.dropna()
print (df1)
a b c d e f
0 a na 2 3 4.0 5
3 d 6 2 3 4.0 5
mask = (df.isin(['na', 'n/a']) | df.isnull()).any(axis=1)
df2 = df[~mask]
print (df2)
a b c d e f
3 d 6 2 3 4.0 5

Pandas distinct count column

Inspired by this post, I would like to get a distinct count of a value in a data frame per a grouping and create a column with the distinct count values in the data frame.
Like this:
Original data frame:
import pandas as pd
df = pd.DataFrame(
{'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'B' : ['foo', 'fo', 'foo', 'foo',
'bar', 'bar', 'ba', 'ba'],
'C' : [2, 4, 4, 2, 5, 4, 3, 2]})
df
A B C
0 foo foo 2
1 foo fo 4
2 foo foo 4
3 foo foo 2
4 bar bar 5
5 bar bar 4
6 bar ba 3
7 bar ba 2
Method from linked post applied:
df=df.groupby(['A','B'])['C'].apply(lambda x: len(x.unique()))
df
Result per linked post method:
A B
bar ba 2
bar 2
foo fo 1
foo 2
Name: C, dtype: int64
Desired result:
A B C Distinct Count of C per A and B
0 foo foo 2 2
1 foo fo 4 1
2 foo foo 4 2
3 foo foo 2 2
4 bar bar 5 2
5 bar bar 4 2
6 bar ba 3 2
7 bar ba 2 2
Looking at the first row, the combination of "foo" in "A" and "foo" in "B" has 2 unique values associated with it (2 and 4), resulting in a 2 in each row for that combination of values for columns A and B.
Thank in advance!
Use transform instead of apply because it return column with the same size as original, I couldn't find documentation on the original pandas site for that, but from help:
transform(func, *args, **kwargs) method of
pandas.core.groupby.SeriesGroupBy instance
Call function producing a like-indexed Series on each group and return
a Series with the transformed values
df['Distinct Count of C per A and B'] = df.groupby(['A','B'])['C'].transform(lambda x: len(x.unique()))
In [1495]: df
Out[1495]:
A B C Distinct Count of C per A and B
0 foo foo 2 2
1 foo fo 4 1
2 foo foo 4 2
3 foo foo 2 2
4 bar bar 5 2
5 bar bar 4 2
6 bar ba 3 2
7 bar ba 2 2

Resources