Sub-Totals & Total In Pandas [duplicate] - python-3.x

This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.

Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508

(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509

You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.

For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).

I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark

I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).

I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494

The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%

You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815

I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935

As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names

df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.

One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.

Related

Finding the largest (N) proportion of percentage in pandas dataframe

Suppose I have the following df:
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df
looks like:
name visits
0 Sara 0
1 John 0
2 Christine 1
3 Paul 2
4 Jo 3
5 Zack 9
6 Chris 6
7 Mathew 10
8 Suzan 3
I did some lines of code to get the percentage of visits per name and sort them descending:
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)
Now I have got the percent of visits to total visits by all names:
name visits percent
0 Mathew 10 0.294118
1 Zack 9 0.264706
2 Chris 6 0.176471
3 Jo 3 0.088235
4 Suzan 3 0.088235
5 Paul 2 0.058824
6 Christine 1 0.029412
7 Sara 0 0.000000
8 John 0 0.000000
What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.
I know I can select the top 3 by using nlargest:
df.nlargest(3, 'percent')
But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.
EDIT:
How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?
You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]
which prints
name visits percent
5 Zack 9 0.264706
6 Chris 6 0.176471
7 Mathew 10 0.294118

Python 3 Pandas fast lookup in dictionary for column

I have a Pandas DataFrame where I need to add new columns of data from lookup Dictionaries. I am looking for the fastest way to do this. I have a way that works using DataFrame.map() with a lambda but I wanted to know if this was the best practice and best performance I could achieve. I am used to doing with work with R and the excellent data.table library. I am working in a Jupyter notebook which is what is letting me use %time on the final line.
Here is what I have:
import numpy as np
import pandas as pd
np.random.seed(123)
num_samples = 100_000_000
ids = np.arange(0, num_samples)
states = ['Oregon', 'Michigan']
cities = ['Portland', 'Detroit']
state_data = {
0:{'Name': 'Oregon', 'mean': 100, 'std_dev': 5},
1:{'Name': 'Michigan', 'mean':90, 'std_dev': 8}
}
city_data = {
0:{'Name': 'Portland', 'mean': 8, 'std_dev':3},
1:{'Name': 'Detroit','mean': 4, 'std_dev':3}
}
state_df = pd.DataFrame.from_dict(state_data,orient='index')
print(state_df)
city_df = pd.DataFrame.from_dict(city_data,orient='index')
print(city_df)
sample_df = pd.DataFrame({'id':ids})
sample_df['state_id'] = np.random.randint(0, 2, num_samples)
sample_df['city_id'] = np.random.randint(0, 2, num_samples)
%time sample_df['state_mean'] = sample_df['state_id'].map(state_data).map(lambda x : x['mean'])
The last line is what I am most focused on.
I have also tried the following but saw no significant performance difference:
%time sample_df['state_mean'] = sample_df['state_id'].map(lambda x : state_data[x]['mean'])
What I ultimately want is to get sample_df to have columns for each of the states and cities. So I would have the following columns in the table:
id | state | state_mean | state_std_dev | city | city_mean | city_std_dev
Use DataFrame.join if you want add all columns:
sample_df = sample_df.join(state_df,on = 'state_id')
# id state_id city_id Name mean std_dev
#0 0 0 0 Oregon 100 5
#1 1 1 1 Michigan 90 8
#2 2 0 0 Oregon 100 5
#3 3 0 0 Oregon 100 5
#4 4 0 0 Oregon 100 5
#... ... ... ... ... ... ...
#9995 9995 1 0 Michigan 90 8
#9996 9996 1 1 Michigan 90 8
#9997 9997 0 1 Oregon 100 5
#9998 9998 1 1 Michigan 90 8
#9999 9999 1 0 Michigan 90 8
for one column
sample_df['state_mean'] = sample_df['state_id'].map(state_df['mean'])

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

Is KeyError in dataframe caused incorrect groupby application?

I have made a summarized dataframe from another dataframe using .groupby and .agg.
sum_df = cnms_df.groupby(['Tiermetric', 'Mod_unMod', 'Val_Combined', 'Det_Approx', 'State', 'Region', 'CO_FIPS']).agg({'MILES': 'sum'})
However, something doesn't look quite right; there seems to be missing values.
Tiermetric Mod_unMod Val_Combined Det_Approx State Region CO_FIPS MILES
Other 1 UnMapped ASSESSED Approx IN 5 18001 8.397255
18003 3.284817
18011 64.019156
18017 9.068318
TIER 4 Modernized VALID Detailed NC 4 37119 2.046716
NC 4 37120 59.890107
NC 4 37025 3.773599
When I try to do something like this:
sum_df['CO_FIPS'][0]
I get an error that seems related to indexing:
KeyError: 'CO_FIPS'
What I want is for my final dataframe to look like this:
Tiermetric Mod_unMod Val_Combined Det_Approx State Region CO_FIPS MILES
Other 1 UnMapped ASSESSED Approx IN 5 18001 8.397255
Other 1 UnMapped ASSESSED Approx IN 5 18003 3.284817
Other 1 UnMapped ASSESSED Approx IN 5 18011 64.019156
Other 1 UnMapped ASSESSED Approx IN 5 18017 9.068318
TIER 4 Modernized VALID Detailed NC 4 37119 2.046716
TIER 4 Modernized VALID Detailed NC 4 37120 59.890107
TIER 4 Modernized VALID Detailed NC 4 37025 3.773599
How can I fix this?
The groupby and sum cause all of these columns to become a multi-index. You can use reset_index() or pass in as_index=False to turn the index into columns.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':[1, 1, 2], 'b':[10, 20, 30]})
In [3]: df
Out[3]:
a b
0 1 10
1 1 20
2 2 30
In [4]: grouped = df.groupby('a').agg({'b': 'sum'})
In [5]: grouped # a is an index now
Out[5]:
b
a
1 30
2 30
In [6]: grouped = df.groupby('a', as_index=False).agg({'b': 'sum'})
In [7]: grouped # now a is a column
Out[7]:
a b
0 1 30
1 2 30
This will work with multi-indexes as well.
Set as_index to False, by default this will be True:
sum_df = cnms_df.groupby(as_index=False,by=['Tiermetric', 'Mod_unMod', 'Val_Combined', 'Det_Approx', 'State', 'Region', 'CO_FIPS']).agg({'MILES': 'sum'})

Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes [duplicate]

How can I select rows from a DataFrame based on values in some column in Pandas?
In SQL, I would use:
SELECT *
FROM table
WHERE column_name = some_value
To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:
df.loc[~df['column_name'].isin(some_values)]
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:
print(df.loc[df['B'].isin(['one','three'])])
yields
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:
df = df.set_index(['B'])
print(df.loc['one'])
yields
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin:
df.loc[df.index.isin(['one','two'])]
yields
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
There are several ways to select rows from a Pandas dataframe:
Boolean indexing (df[df['col'] == value] )
Positional indexing (df.iloc[...])
Label indexing (df.xs(...))
df.query(...) API
Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'
(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)
Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.
Borrowing from #unutbu:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
1. Boolean indexing
... Boolean indexing requires finding the true value of each row's 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.
mask = df['A'] == 'foo'
We can then use this mask to slice or index the data frame
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.
2. Positional indexing
Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
3. Label indexing
Label indexing can be very handy, but in this case, we are again doing more work for no benefit
df.set_index('A', append=True, drop=False).xs('foo', level=1)
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
4. df.query() API
pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.
df.query('A == "foo"')
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
My preference is to use the Boolean mask
Actual improvements can be made by modifying how we create our Boolean mask.
mask alternative 1
Use the underlying NumPy array and forgo the overhead of creating another pd.Series
mask = df['A'].values == 'foo'
I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.
Next, we'll look at the timing for slicing with one mask versus the other.
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.
mask alternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!
Instead of df[mask] we will do this
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, if the data frame is not of mixed type, this is a very useful way to do it.
Given
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
d1
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
%%timeit
mask = d1['A'].values == 7
d1[mask]
179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Versus
%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We cut the time in half.
mask alternative 3
#unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.
mask = df['A'].isin(['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We'll use np.in1d
mask = np.in1d(df['A'].values, ['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing
I'll include other concepts mentioned in other posts as well for reference.
Code Below
Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.
res.div(res.min())
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
You'll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.
res.T.plot(loglog=True)
Functions
def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
Testing
res = pd.DataFrame(
index=[
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
Special Timing
Looking at the special case when we have a single non-object dtype for the entire data frame.
Code Below
spec.div(spec.min())
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
Turns out, reconstruction isn't worth it past a few hundred rows.
spec.T.plot(loglog=True)
Functions
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
Testing
for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)
tl;dr
The Pandas equivalent to
select * from table where column_name = some_value
is
table[table.column_name == some_value]
Multiple conditions:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or
table.query('column_name == some_value | column_name2 == some_value2')
Code example
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.
Multiple conditions are also possible:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the query function, since it's less verbose and yields the same result:
df.query('foo == 222 | bar == 444')
I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val').
Reproduced from The query() Method (Experimental):
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an #.
exclude = ('red', 'orange')
df.query('color not in #exclude')
More flexibility using .query with pandas >= 0.25.0:
Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) - see GitHub:
# Example dataframe
df = pd.DataFrame({'Sender email':['ex#example.com', "reply#shop.com", "buy#shop.com"]})
Sender email
0 ex#example.com
1 reply#shop.com
2 buy#shop.com
Using .query with method str.endswith:
df.query('`Sender email`.str.endswith("#shop.com")')
Output
Sender email
1 reply#shop.com
2 buy#shop.com
Also we can use local variables by prefixing it with an # in our query:
domain = 'shop.com'
df.query('`Sender email`.str.endswith(#domain)')
Output
Sender email
1 reply#shop.com
2 buy#shop.com
For selecting only specific columns out of multiple columns for a given value in Pandas:
select col_name1, col_name2 from table where column_name = some_value.
Options loc:
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
or query:
df.query('column_name == some_value')[[col_name1, col_name2]]
In newer versions of Pandas, inspired by the documentation (Viewing data):
df[df["colume_name"] == some_value] #Scalar, True/False..
df[df["colume_name"] == "some_value"] #String
Combine multiple conditions by putting the clause in parentheses, (), and combining them with & and | (and/or). Like this:
df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]
Other filters
pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase
Faster results can be achieved using numpy.where.
For example, with unubtu's setup -
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing comparisons:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
Here is a simple example
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111
To add: You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.,
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
Running this gives:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
You can also use .apply:
df.apply(lambda row: row[df['B'].isin(['one','three'])])
It actually works row-wise (i.e., applies the function to each row).
The output is
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
The results is the same as using as mentioned by #unutbu
df[[df['B'].isin(['one','three'])]]
1. Use f-strings inside query() calls
If the column name used to filter your dataframe comes from a local variable, f-strings may be useful. For example,
col = 'A'
df.query(f"{col} == 'foo'")
In fact, f-strings can be used for the query variable as well (except for datetime):
col = 'A'
my_var = 'foo'
df.query(f"{col} == '{my_var}'") # if my_var is a string
my_num = 1
df.query(f"{col} == {my_num}") # if my_var is a number
my_date = '2022-12-10'
df.query(f"{col} == #my_date") # must use # for datetime though
2. Install numexpr to speed up query() calls
The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.
For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.
Logical and/or comparison operators on columns of strings
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2
df[df.A == 'foo']
df.query("A == 'foo'") # <--- performs 30%-60% faster
This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2
Multiple operations on numeric columns
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2
df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster.
This gap in performance increases as the number of operations increases and/or the dataframe length increases.2
The following plot shows how the methods perform as the dataframe length increases.3
3. Call pandas methods inside query()
Numexpr currently supports only logical (&, |, ~), comparison (==, >, <, >=, <=, !=) and basic arithmetic operators (+, -, *, /, **, %).
For example, it doesn't support integer division (//). However, calling the equivalent pandas method (floordiv()) works.
df.query('B.floordiv(2) <= 3') # or
df.query('B.floordiv(2).le(3)')
# for pandas < 1.4, need `.values`
df.query('B.floordiv(2).values <= 3')
1 Benchmark code using a frame with 80k rows
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 Benchmark code using a frame with 800k rows
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
3: Code used to produce the performance graphs of the two methods for strings and numbers.
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
setup=constructor,
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
equality_check=pd.DataFrame.equals);
plot(
setup=constructor,
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
equality_check=pd.DataFrame.equals);
If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
After make my_dict dictionary you can go through:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
If you have duplicated values in column_name you can't make a dictionary. but you can use:
my_result = my_df.loc[some_value]
SQL statements on DataFrames to select rows using DuckDB
With DuckDB we can query pandas DataFrames with SQL statements, in a highly performant way.
Since the question is How do I select rows from a DataFrame based on column values?, and the example in the question is a SQL query, this answer looks logical in this topic.
Example:
In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
Out[5]:
A B
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
Out[7]:
A B
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21
You can use loc (square brackets) with a function:
# Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
Output:
1 2
2 3
3 4
dtype: int64
or
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]
Output:
A B
1 2 20
2 3 30
Great answers. Only, when the size of the dataframe approaches million rows, many of the methods tend to take ages when using df[df['col']==val]. I wanted to have all possible values of "another_column" that correspond to specific values in "some_column" (in this case in a dictionary). This worked and fast.
s=datetime.datetime.now()
my_dict={}
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
my_dict[my_key]={}
my_dict[my_key]['values']=[df.iloc[i]['another_column']]
else:
my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
e=datetime.datetime.now()
print('operation took '+str(e-s)+' seconds')```

Resources