Error : 'Series' object has no attribute 'sort' [duplicate] - python-3.x

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).

Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
# False
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2


How to extract an arithmetic operation from a string with Pandas

In a Pandas DataFrame
>> df.head()
0 1 â#0.00 + "s=?0.07 + 'due0.93 rt#-[ 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 rt#-[ 2.97
I would like to extract only the numeric values. Column C I can do with, e.g.,
>> extr = df['C'].str.extract(r'(\d+\.\d+)', expand=False)
>> df['C'] = pd.to_numeric(extr)
>> df.head()
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97
but I have problems with the B column. How can I extract the + operations, as well as the floats? I tried
>> extr = df['B'].str.extract(r'(\d+\.\d+)\+(\d+\.\d+)\+(\d+\.\d+)', expand=False)
which I was hoping would give me something like
0 '0.00+0.07+0.93'
1 '0.02+0.16+0.82'
but instead it gives me three columns with NaN values in them:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
So how could I extract the whole arithmetic operations?
(Only the + operations are needed, and any other characters, such as -, can be ignored.)
An alternate approach using Series.str.findall:
df['B'] = df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
# print(df)
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
timeit comparision of all the solutions:
(20000, 4)
%%timeit -n100 #Shubham solution
31.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Rakesh solution
32.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Sammy solution
["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
36.8 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #JudV solution
df['B'].str.replace(r'[^\d.+]', '')
59.7 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
One way is to run a str join on the extracted data, using + as the delimiter
import re
df = pd.read_clipboard(sep='\s{2,}')
df['extract'] = ["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
A B C extract
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01 0.00+0.07+0.93
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97 0.02+0.16+0.82
This is one approach using str.findall & .str.join("+")
df = pd.DataFrame({"B": ["""â#0.00 + "s=?0.07 + 'due0.93""", """â#0.02 + "s=?0.16 + 'due0.82"""]})
df["Z"] = df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
0 â#0.00 + "s=?0.07 + 'due0.93 0.00+0.07+0.93
1 â#0.02 + "s=?0.16 + 'due0.82 0.02+0.16+0.82
Python is not my forte but I'd use replace instead and do the operation for both columns, maybe look into:
df[['B', 'C']] = df[['B','C']].replace(r'[^\d.+]', '', regex=True)
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
If it's just column B you are after than maybe simply use:
extr = df['B'].str.replace(r'[^\d.+]', '')

Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list)

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).
Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] =['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] =['col1'], axis=0).mask(df1.isna()).mean(axis=1)
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes [duplicate]

How can I select rows from a DataFrame based on values in some column in Pandas?
In SQL, I would use:
FROM table
WHERE column_name = some_value
To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
Combine multiple conditions with &:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:
df = df.set_index(['B'])
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin:
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
There are several ways to select rows from a Pandas dataframe:
Boolean indexing (df[df['col'] == value] )
Positional indexing (df.iloc[...])
Label indexing (df.xs(...))
df.query(...) API
Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'
(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.
Borrowing from #unutbu:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
1. Boolean indexing
... Boolean indexing requires finding the true value of each row's 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.
mask = df['A'] == 'foo'
We can then use this mask to slice or index the data frame
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.
2. Positional indexing
Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
3. Label indexing
Label indexing can be very handy, but in this case, we are again doing more work for no benefit
df.set_index('A', append=True, drop=False).xs('foo', level=1)
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
4. df.query() API
pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.
df.query('A == "foo"')
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
My preference is to use the Boolean mask
Actual improvements can be made by modifying how we create our Boolean mask.
mask alternative 1
Use the underlying NumPy array and forgo the overhead of creating another pd.Series
mask = df['A'].values == 'foo'
I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.
Next, we'll look at the timing for slicing with one mask versus the other.
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.
mask alternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!
Instead of df[mask] we will do this
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, if the data frame is not of mixed type, this is a very useful way to do it.
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
mask = d1['A'].values == 7
179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We cut the time in half.
mask alternative 3
#unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.
mask = df['A'].isin(['foo'])
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We'll use np.in1d
mask = np.in1d(df['A'].values, ['foo'])
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
I'll include other concepts mentioned in other posts as well for reference.
Code Below
Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
You'll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.
def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
res = pd.DataFrame(
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)[i, j] = timeit(stmt, setp, number=50)
Special Timing
Looking at the special case when we have a single non-object dtype for the entire data frame.
Code Below
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
Turns out, reconstruction isn't worth it past a few hundred rows.
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)[i, j] = timeit(stmt, setp, number=50)
The Pandas equivalent to
select * from table where column_name = some_value
table[table.column_name == some_value]
Multiple conditions:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
table.query('column_name == some_value | column_name2 == some_value2')
Code example
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[ == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[ == 222] that gives the rows based on the column value, 222 in this case.
Multiple conditions are also possible:
df[( == 222) | ( == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the query function, since it's less verbose and yields the same result:
df.query('foo == 222 | bar == 444')
I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val').
Reproduced from The query() Method (Experimental):
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an #.
exclude = ('red', 'orange')
df.query('color not in #exclude')
More flexibility using .query with pandas >= 0.25.0:
Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) - see GitHub:
# Example dataframe
df = pd.DataFrame({'Sender email':['', "", ""]})
Sender email
Using .query with method str.endswith:
df.query('`Sender email`.str.endswith("")')
Sender email
Also we can use local variables by prefixing it with an # in our query:
domain = ''
df.query('`Sender email`.str.endswith(#domain)')
Sender email
For selecting only specific columns out of multiple columns for a given value in Pandas:
select col_name1, col_name2 from table where column_name = some_value.
Options loc:
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
or query:
df.query('column_name == some_value')[[col_name1, col_name2]]
In newer versions of Pandas, inspired by the documentation (Viewing data):
df[df["colume_name"] == some_value] #Scalar, True/False..
df[df["colume_name"] == "some_value"] #String
Combine multiple conditions by putting the clause in parentheses, (), and combining them with & and | (and/or). Like this:
df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]
Other filters
pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase
Faster results can be achieved using numpy.where.
For example, with unubtu's setup -
In [76]: df.iloc[np.where(df.A.values=='foo')]
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing comparisons:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
Here is a simple example
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
# Result:
# Cost Revenue
# 1 444 111
To add: You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.,
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
Running this gives:
Original dataframe:
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
0 foo two
1 foo two
2 bar two
You can also use .apply:
df.apply(lambda row: row[df['B'].isin(['one','three'])])
It actually works row-wise (i.e., applies the function to each row).
The output is
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
The results is the same as using as mentioned by #unutbu
1. Use f-strings inside query() calls
If the column name used to filter your dataframe comes from a local variable, f-strings may be useful. For example,
col = 'A'
df.query(f"{col} == 'foo'")
In fact, f-strings can be used for the query variable as well (except for datetime):
col = 'A'
my_var = 'foo'
df.query(f"{col} == '{my_var}'") # if my_var is a string
my_num = 1
df.query(f"{col} == {my_num}") # if my_var is a number
my_date = '2022-12-10'
df.query(f"{col} == #my_date") # must use # for datetime though
2. Install numexpr to speed up query() calls
The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.
For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.
Logical and/or comparison operators on columns of strings
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2
df[df.A == 'foo']
df.query("A == 'foo'") # <--- performs 30%-60% faster
This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2
Multiple operations on numeric columns
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2
df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster.
This gap in performance increases as the number of operations increases and/or the dataframe length increases.2
The following plot shows how the methods perform as the dataframe length increases.3
3. Call pandas methods inside query()
Numexpr currently supports only logical (&, |, ~), comparison (==, >, <, >=, <=, !=) and basic arithmetic operators (+, -, *, /, **, %).
For example, it doesn't support integer division (//). However, calling the equivalent pandas method (floordiv()) works.
df.query('B.floordiv(2) <= 3') # or
# for pandas < 1.4, need `.values`
df.query('B.floordiv(2).values <= 3')
1 Benchmark code using a frame with 80k rows
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 Benchmark code using a frame with 800k rows
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
3: Code used to produce the performance graphs of the two methods for strings and numbers.
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
After make my_dict dictionary you can go through:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
If you have duplicated values in column_name you can't make a dictionary. but you can use:
my_result = my_df.loc[some_value]
SQL statements on DataFrames to select rows using DuckDB
With DuckDB we can query pandas DataFrames with SQL statements, in a highly performant way.
Since the question is How do I select rows from a DataFrame based on column values?, and the example in the question is a SQL query, this answer looks logical in this topic.
In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21
You can use loc (square brackets) with a function:
# Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
1 2
2 3
3 4
dtype: int64
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]
1 2 20
2 3 30
Great answers. Only, when the size of the dataframe approaches million rows, many of the methods tend to take ages when using df[df['col']==val]. I wanted to have all possible values of "another_column" that correspond to specific values in "some_column" (in this case in a dictionary). This worked and fast.
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
print('operation took '+str(e-s)+' seconds')```

how can I transform the following data into pandas dataframe [duplicate]

This question already has answers here:
Unnest (explode) a Pandas Series
(8 answers)
Closed 4 years ago.
I have this type of data I want this each list of each id in seperate column
id data
2 [1.81744912347, 1.96313966807, 1.79290908923]
3 [0.87738744314, 0.154642653196, 0.319845728764]
4 [1.12289279512, 1.16105905267, 1.14889626137]
5 [1.65093687407, 1.65010263863, 1.65614839538]
6 [0.103623262651, 0.46093367049, 0.549343505693]
7 [0.122299243819, 0.355964399805, 0.40010681636]
8 [3.08321032223, 2.92526466342, 2.6504125359, 2]
9 [0.287041436848, 0.264107869667, 0.29319302508]
10 [0.673829091668, 0.632715325748, 0.47099544284]
11 [3.04589375431, 2.19130582148, 1.68173686657]
how can I transform the data into the pandas DataFrame
I want it as the following data
id data
1 1.61567967235
1 1.55256213176
1 1.16904355984
10 0.673829091668
10 0.632715325748
and so on
its large amount of data, if I use the loop to transform it, it kills the notebook, is there any other way to process this data,
sample image of the data
IIUC, from
0 [1, 2, 3]
1 [4, 5, 6]
can do
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
pd.Series([z for x in df.col.values for z in x])
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
%timeit df.col.apply(pd.Series).stack().reset_index(drop=True)
1.15 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Series([z for x in df.col.values for z in x])
89.2 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Applying RMS formula over three columns pandas

I am trying to apply a RMS function for Accelero-meter data which is in 3 dimension. Also, I have a time stamp column at the beginning which I have kept in days count. So the dataframe is as follows:
0 1 2 3
0 1.963 -12.0 -71.0 -2.0
1 1.963 -11.0 -71.0 -3.0
2 1.963 -14.0 -67.0 -6.0
3 1.963 -16.0 -63.0 -7.0
4 1.963 -18.0 -60.0 -8.0
column '0' is Days, and all the other columns are the 3-axis data of accelero-meter. Right now I am using this approach to compute the RMS value to a new column and drop the existing 3-axis data :
def rms_detrend(x):
return np.sqrt(np.mean(x[1]**2 + x[2]**2 + x[3]**2))
accdf =pd.read_csv(ACC_files[1],header=None)
accdf['ACC_RMS'] = accdf.apply(rms_detrend,axis=1)
accdf = accdf.drop([1,2,3],axis=1)
accdf.columns = accdf['Days','ACC_RMS']
However, I have 70 such files of Accelerometer data each with about 4000+ rows. So is there a better and quicker(pythonic) way to do this ? Thanks.
The code above I have done for just one file and its very slow.
A method from pandas
0 72.034714
1 71.909666
2 68.709534
3 65.375837
4 63.150614
dtype: float64
accdf['ACC_RMS'] = np.sqrt(accdf.pop(1)**2 + accdf.pop(2)**2 + accdf.pop(3)**2)
print (accdf)
0 1.963 72.034714
1 1.963 71.909666
2 1.963 68.709534
3 1.963 65.375837
4 1.963 63.150614
Numpy solution for improve performance:
#[50000 rows x 4 columns]
accdf = pd.concat([accdf] * 10000, ignore_index=True)
In [27]: %timeit (accdf.iloc[:,1:]**2).sum(1).pow(1/2)
1.97 ms ± 89.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: %timeit np.sqrt(np.sum(accdf.to_numpy()[:,1:]**2, axis=1))
202 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Unfortunately my solution return error for testing, but I guess it is slowier like numpy only solution.
