Pandas grouping with count [duplicate]

I have the following dataframe:
df = pd.DataFrame([
(1, 1, 'term1'),
(1, 2, 'term2'),
(1, 1, 'term1'),
(1, 1, 'term2'),
(2, 2, 'term3'),
(2, 3, 'term1'),
(2, 2, 'term1')
], columns=['id', 'group', 'term'])
I want to group it by id and group and calculate the number of each term for this id, group pair.
So in the end I am going to get something like this:
I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).
It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts() which does not work because value_counts operates on the groupby series and not a dataframe.
Anyway I can achieve this without looping?

I use groupby and size
df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1,000,000 rows
df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
group=np.random.choice(20, 1000000),
term=np.random.choice(10, 1000000)))

using pivot_table() method:
In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
term term1 term2 term3
id group
1 1 2 1 0
2 0 1 0
2 2 1 0 1
3 1 0 0
Timing against 700K rows DF:
In [24]: df = pd.concat([df] * 10**5, ignore_index=True)
In [25]: df.shape
Out[25]: (700000, 3)
In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop
In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop
In [5]: %timeit pd.crosstab([,], df.term)
1 loop, best of 3: 355 ms per loop
In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop
In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop
Timing against 7M rows DF:
In [9]: df = pd.concat([df] * 10, ignore_index=True)
In [10]: df.shape
Out[10]: (7000000, 3)
In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop
In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop
In [13]: %timeit pd.crosstab([,], df.term)
1 loop, best of 3: 3.37 s per loop
In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop
In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:
df.groupby(['id', 'group', 'term']).count()

You can use crosstab:
print (pd.crosstab([,], df.term))
term term1 term2 term3
id group
1 1 2 1 0
2 0 1 0
2 2 1 0 1
3 1 0 0
Another solution with groupby with aggregating size, reshaping by unstack:
df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
term term1 term2 term3
id group
1 1 2 1 0
2 0 1 0
2 2 1 0 1
3 1 0 0
df = pd.concat([df]*10000).reset_index(drop=True)
In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop
In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop

If you want to use value_counts you can use it on a given series, and resort to the following:
df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)
or in an equivalent fashion, using the .agg method:
df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)
Another option is to directly use value_counts on the DataFrame itself without resorting to groupby:

Another alternative:
df.assign(count=1).groupby(['id', 'group','term']).sum().unstack(fill_value=0).xs("count", 1)
term term1 term2 term3
id group
1 1 2 1 0
2 0 1 0
2 2 1 0 1
3 1 0 0


Assign values to pandas column based on condition [duplicate]

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?
You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
Consider the dataframe df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
Instead do
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
special note
This is still incorrect if an entire group has missing values
Using ffill() directly will give the best results. Here is the comparison
%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop
%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop
%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop
what about this

Create a new column in Dataframe based on condition being check. I want to check value of a column for multiple values and then assign it a value [duplicate]

How can I achieve the equivalents of SQL's IN and NOT IN?
I have a list with the required values.
Here's the scenario:
df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['UK', 'China']
# pseudo-code:
df[df['country'] not in countries_to_keep]
My current way of doing this is as follows:
df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
df2 = pd.DataFrame({'country': ['UK', 'China'], 'matched': True})
# IN
df.merge(df2, how='inner', on='country')
not_in = df.merge(df2, how='left', on='country')
not_in = not_in[pd.isnull(not_in['matched'])]
But this seems like a horrible kludge. Can anyone improve on it?
You can use pd.Series.isin.
For "IN" use: something.isin(somewhere)
Or for "NOT IN": ~something.isin(somewhere)
As a worked example:
>>> df
0 US
1 UK
2 Germany
3 China
>>> countries_to_keep
['UK', 'China']
0 False
1 True
2 False
3 True
Name: country, dtype: bool
>>> df[]
1 UK
3 China
>>> df[]
0 US
2 Germany
Alternative solution that uses .query() method:
In [5]: df.query("countries in #countries_to_keep")
1 UK
3 China
In [6]: df.query("countries not in #countries_to_keep")
0 US
2 Germany
How to implement 'in' and 'not in' for a pandas DataFrame?
Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.
Filter DataFrame Based on ONE Column (also applies to Series)
The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.
df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
0 US
1 UK
2 Germany
3 China
c1 = ['UK', 'China'] # list
c2 = {'Germany'} # set
c3 = pd.Series(['China', 'US']) # Series
c4 = np.array(['US', 'UK']) # array
Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:
0 False
1 True
2 False
3 False
4 True
Name: countries, dtype: bool
# `in` operation
1 UK
4 China
# `not in` operation
0 US
2 Germany
3 NaN
# Filter with `set` (tuples work too)
2 Germany
# Filter with another Series
0 US
4 China
# Filter with array
0 US
1 UK
Filter on MANY Columns
Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,
df2 = pd.DataFrame({
'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
0 x w 0
1 y a 1
2 z NaN 2
3 q x 3
c1 = ['x', 'w', 'p']
To apply the isin condition to both columns "A" and "B", use DataFrame.isin:
df2[['A', 'B']].isin(c1)
0 True True
1 False False
2 False False
3 False True
From this, to retain rows where at least one column is True, we can use any along the first axis:
df2[['A', 'B']].isin(c1).any(axis=1)
0 True
1 False
2 False
3 True
dtype: bool
df2[df2[['A', 'B']].isin(c1).any(axis=1)]
0 x w 0
3 q x 3
Note that if you want to search every column, you'd just omit the column selection step and do
Similarly, to retain rows where ALL columns are True, use all in the same manner as before.
df2[df2[['A', 'B']].isin(c1).all(axis=1)]
0 x w 0
Notable Mentions: numpy.isin, query, list comprehensions (string data)
In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.
# `in` operation
df[np.isin(df['countries'], c1)]
1 UK
4 China
# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]
0 US
2 Germany
3 NaN
Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.
Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here..
We resort to an in check now.
c1_set = set(c1) # Using `in` with `sets` is a constant time operation...
# This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]
1 UK
4 China
# `not in` operation
df[[x not in c1_set for x in df['countries']]]
0 US
2 Germany
3 NaN
It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.
Lastly, there's also DataFrame.query which has been covered in this answer. numexpr FTW!
I've been usually doing generic filtering over rows like this:
criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]
Collating possible solutions from the answers:
For IN: df[df['A'].isin([3, 6])]
df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]
I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds
dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]
Why is no one talking about the performance of various filtering methods? In fact, this topic often pops up here (see the example). I did my own performance test for a large data set. It is very interesting and instructive.
df = pd.DataFrame({'animals': np.random.choice(['cat', 'dog', 'mouse', 'birds'], size=10**7),
'number': np.random.randint(0,100, size=(10**7,))})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 animals object
1 number int64
dtypes: int64(1), object(1)
memory usage: 152.6+ MB
# .isin() by one column
conditions = ['cat', 'dog']
367 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# .query() by one column
conditions = ['cat', 'dog']
df.query('animals in #conditions')
395 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# .loc[]
987 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df[df.apply(lambda x: x['animals'] in ['cat', 'dog'], axis=1)]
41.9 s ± 490 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new_df = df.set_index('animals')
new_df.loc[['cat', 'dog'], :]
3.64 s ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new_df = df.set_index('animals')
new_df[new_df.index.isin(['cat', 'dog'])]
469 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
s = pd.Series(['cat', 'dog'], name='animals')
df.merge(s, on='animals', how='inner')
796 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thus, the isin method turned out to be the fastest and the method with apply() was the slowest, which is not surprising.
You can also use .isin() inside .query():
# Or alternatively:
df.query('country.isin(["UK", "China"]).values')
To negate your query, use ~:
Another way is to use comparison operators:
df.query('country == #countries_to_keep')
# Or alternatively:
df.query('country == ["UK", "China"]')
And to negate the query, use !=:
df.query('country != #countries_to_keep')
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']
implement in:
implement not in as in of rest countries:
df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]
A trick if you want to keep the order of the list:
df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['Germany', 'US']
ind=[df.index[df['country']==i].tolist() for i in countries_to_keep]
flat_ind=[item for sublist in ind for item in sublist]
2 Germany
0 US
My 2c worth:
I needed a combination of in and ifelse statements for a dataframe, and this worked for me.
sale_method = pd.DataFrame(model_data["Sale Method"].str.upper())
sale_method["sale_classification"] = np.where(
sale_method["Sale Method"].isin(["PRIVATE"]),
sale_method["Sale Method"].str.contains("AUCTION"), "auction", "other"

How to convert a column of a dataframe from char to ascii integers? [Pandas]

I have a dataframe in which one columns called 'label' holds values like 'b', 'm', 'n' etc.
I want 'label' to instead hold the ascii equivalent of the letter.
How do I do it?
In [81]:
df = pd.DataFrame({'label':list('bmn')})
0 b
1 m
2 n
In [82]:
df['ascii'] = df['label'].apply(ord)
label ascii
0 b 98
1 m 109
2 n 110
It maybe quicker to do a list comprehension:
In [83]:
df['ascii'] = [ord(x) for x in df['label']]
label ascii
0 b 98
1 m 109
2 n 110
You can also use map:
In [85]:
df['ascii'] = df['label'].map(ord)
label ascii
0 b 98
1 m 109
2 n 110
for a small df:
In [87]:
%timeit [ord(x) for x in df['label']]
%timeit df['label'].map(ord)
%timeit df['label'].apply(ord)
100000 loops, best of 3: 14 µs per loop
10000 loops, best of 3: 123 µs per loop
10000 loops, best of 3: 146 µs per loop
For a 3K df:
In [89]:
%timeit [ord(x) for x in df['label']]
%timeit df['label'].map(ord)
%timeit df['label'].apply(ord)
1000 loops, best of 3: 246 µs per loop
1000 loops, best of 3: 1 ms per loop
1000 loops, best of 3: 1.02 ms per loop
So here the list comprehension scales better than the other methods
e.g. "a"=97 in ascii}
write print(ord("a"))
answer would be 97

Pandas Set Top Row as MultiIndex Level 1

Given the following data frame:
Item other
0 items others
1 y bb
2 z cc
3 x dd
I'd like to create a multiindexed set of headers such that the current headers become level 0 and the current top row becomes level 1.
Thanks in advance!
Another solution is create MultiIndex.from_tuples:
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
print (pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:]))
Item other
0 items others
1 y bb
2 z cc
3 x dd
Or if column names are not important:
cols = list(zip(d2.columns, d2.iloc[0,:]))
d2.columns = pd.MultiIndex.from_tuples(cols)
print (d2[1:])
Item other
items others
1 y bb
2 z cc
3 x dd
In [63]: %timeit jez(d22)
100 loops, best of 3: 6.22 ms per loop
In [64]: %timeit piR(d2)
10 loops, best of 3: 84.9 ms per loop
In [70]: %timeit jez(d22)
The slowest run took 4.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 941 µs per loop
In [71]: %timeit piR(d2)
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.36 ms per loop
import pandas as pd
print (d2)
d2 = pd.concat([d2]*100000).reset_index(drop=True)
#d2 = pd.concat([d2]*10).reset_index(drop=True)
d22 = d2.copy()
def piR(d2):
return (d2.T.set_index(0, append=1).T)
def jez(d2):
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
return pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:])
print (piR(d2))
print (jez(d22))
print ((piR(d2) == jez(d22)).all())
Item items True
other others True
dtype: bool
Transpose the DataFrame, set_index with the first column with parameter append = True, then Transpose back.
d2.T.set_index(0, append=1).T
