Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes [duplicate]

Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes [duplicate] - python-3.x

How can I select rows from a DataFrame based on values in some column in Pandas?
In SQL, I would use:
SELECT *
FROM table
WHERE column_name = some_value

To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:
df.loc[~df['column_name'].isin(some_values)]
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:
print(df.loc[df['B'].isin(['one','three'])])
yields
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:
df = df.set_index(['B'])
print(df.loc['one'])
yields
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin:
df.loc[df.index.isin(['one','two'])]
yields
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12

There are several ways to select rows from a Pandas dataframe:
Boolean indexing (df[df['col'] == value] )
Positional indexing (df.iloc[...])
Label indexing (df.xs(...))
df.query(...) API
Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'
(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)
Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.
Borrowing from #unutbu:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
1. Boolean indexing
... Boolean indexing requires finding the true value of each row's 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.
mask = df['A'] == 'foo'
We can then use this mask to slice or index the data frame
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.
2. Positional indexing
Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
3. Label indexing
Label indexing can be very handy, but in this case, we are again doing more work for no benefit
df.set_index('A', append=True, drop=False).xs('foo', level=1)
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
4. df.query() API
pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.
df.query('A == "foo"')
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
My preference is to use the Boolean mask
Actual improvements can be made by modifying how we create our Boolean mask.
mask alternative 1
Use the underlying NumPy array and forgo the overhead of creating another pd.Series
mask = df['A'].values == 'foo'
I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.
Next, we'll look at the timing for slicing with one mask versus the other.
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.
mask alternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!
Instead of df[mask] we will do this
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, if the data frame is not of mixed type, this is a very useful way to do it.
Given
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
d1
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
%%timeit
mask = d1['A'].values == 7
d1[mask]
179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Versus
%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We cut the time in half.
mask alternative 3
#unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.
mask = df['A'].isin(['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We'll use np.in1d
mask = np.in1d(df['A'].values, ['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing
I'll include other concepts mentioned in other posts as well for reference.
Code Below
Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.
res.div(res.min())
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
You'll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.
res.T.plot(loglog=True)
Functions
def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
Testing
res = pd.DataFrame(
index=[
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
Special Timing
Looking at the special case when we have a single non-object dtype for the entire data frame.
Code Below
spec.div(spec.min())
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
Turns out, reconstruction isn't worth it past a few hundred rows.
spec.T.plot(loglog=True)
Functions
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
Testing
for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)

tl;dr
The Pandas equivalent to
select * from table where column_name = some_value
is
table[table.column_name == some_value]
Multiple conditions:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or
table.query('column_name == some_value | column_name2 == some_value2')
Code example
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.
Multiple conditions are also possible:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the query function, since it's less verbose and yields the same result:
df.query('foo == 222 | bar == 444')

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val').
Reproduced from The query() Method (Experimental):
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an #.
exclude = ('red', 'orange')
df.query('color not in #exclude')

More flexibility using .query with pandas >= 0.25.0:
Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) - see GitHub:
# Example dataframe
df = pd.DataFrame({'Sender email':['ex#example.com', "reply#shop.com", "buy#shop.com"]})
Sender email
0 ex#example.com
1 reply#shop.com
2 buy#shop.com
Using .query with method str.endswith:
df.query('`Sender email`.str.endswith("#shop.com")')
Output
Sender email
1 reply#shop.com
2 buy#shop.com
Also we can use local variables by prefixing it with an # in our query:
domain = 'shop.com'
df.query('`Sender email`.str.endswith(#domain)')
Output
Sender email
1 reply#shop.com
2 buy#shop.com

For selecting only specific columns out of multiple columns for a given value in Pandas:
select col_name1, col_name2 from table where column_name = some_value.
Options loc:
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
or query:
df.query('column_name == some_value')[[col_name1, col_name2]]

In newer versions of Pandas, inspired by the documentation (Viewing data):
df[df["colume_name"] == some_value] #Scalar, True/False..
df[df["colume_name"] == "some_value"] #String
Combine multiple conditions by putting the clause in parentheses, (), and combining them with & and | (and/or). Like this:
df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]
Other filters
pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase

Faster results can be achieved using numpy.where.
For example, with unubtu's setup -
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing comparisons:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop

Here is a simple example
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111

To add: You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.,
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
Running this gives:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two

You can also use .apply:
df.apply(lambda row: row[df['B'].isin(['one','three'])])
It actually works row-wise (i.e., applies the function to each row).
The output is
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
The results is the same as using as mentioned by #unutbu
df[[df['B'].isin(['one','three'])]]

1. Use f-strings inside query() calls
If the column name used to filter your dataframe comes from a local variable, f-strings may be useful. For example,
col = 'A'
df.query(f"{col} == 'foo'")
In fact, f-strings can be used for the query variable as well (except for datetime):
col = 'A'
my_var = 'foo'
df.query(f"{col} == '{my_var}'") # if my_var is a string
my_num = 1
df.query(f"{col} == {my_num}") # if my_var is a number
my_date = '2022-12-10'
df.query(f"{col} == #my_date") # must use # for datetime though
2. Install numexpr to speed up query() calls
The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.
For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.
Logical and/or comparison operators on columns of strings
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2
df[df.A == 'foo']
df.query("A == 'foo'") # <--- performs 30%-60% faster
This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2
Multiple operations on numeric columns
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2
df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster.
This gap in performance increases as the number of operations increases and/or the dataframe length increases.2
The following plot shows how the methods perform as the dataframe length increases.3
3. Call pandas methods inside query()
Numexpr currently supports only logical (&, |, ~), comparison (==, >, <, >=, <=, !=) and basic arithmetic operators (+, -, *, /, **, %).
For example, it doesn't support integer division (//). However, calling the equivalent pandas method (floordiv()) works.
df.query('B.floordiv(2) <= 3') # or
df.query('B.floordiv(2).le(3)')
# for pandas < 1.4, need `.values`
df.query('B.floordiv(2).values <= 3')
1 Benchmark code using a frame with 80k rows
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 Benchmark code using a frame with 800k rows
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
3: Code used to produce the performance graphs of the two methods for strings and numbers.
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
setup=constructor,
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
equality_check=pd.DataFrame.equals);
plot(
setup=constructor,
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
equality_check=pd.DataFrame.equals);

If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
After make my_dict dictionary you can go through:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
If you have duplicated values in column_name you can't make a dictionary. but you can use:
my_result = my_df.loc[some_value]

SQL statements on DataFrames to select rows using DuckDB
With DuckDB we can query pandas DataFrames with SQL statements, in a highly performant way.
Since the question is How do I select rows from a DataFrame based on column values?, and the example in the question is a SQL query, this answer looks logical in this topic.
Example:
In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
Out[5]:
A B
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
Out[7]:
A B
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21

You can use loc (square brackets) with a function:
# Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
Output:
1 2
2 3
3 4
dtype: int64
or
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]
Output:
A B
1 2 20
2 3 30

Great answers. Only, when the size of the dataframe approaches million rows, many of the methods tend to take ages when using df[df['col']==val]. I wanted to have all possible values of "another_column" that correspond to specific values in "some_column" (in this case in a dictionary). This worked and fast.
s=datetime.datetime.now()
my_dict={}
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
my_dict[my_key]={}
my_dict[my_key]['values']=[df.iloc[i]['another_column']]
else:
my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
e=datetime.datetime.now()
print('operation took '+str(e-s)+' seconds')```

Related

Sub-Totals & Total In Pandas [duplicate]

This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.

Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508

(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509

You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.

For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).

I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark

I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).

I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494

The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%

You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815

I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935

As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names

df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.

One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.

how can I transform the following data into pandas dataframe [duplicate]

This question already has answers here:
Unnest (explode) a Pandas Series
(8 answers)
Closed 4 years ago.
I have this type of data I want this each list of each id in seperate column
id data
2 [1.81744912347, 1.96313966807, 1.79290908923]
3 [0.87738744314, 0.154642653196, 0.319845728764]
4 [1.12289279512, 1.16105905267, 1.14889626137]
5 [1.65093687407, 1.65010263863, 1.65614839538]
6 [0.103623262651, 0.46093367049, 0.549343505693]
7 [0.122299243819, 0.355964399805, 0.40010681636]
8 [3.08321032223, 2.92526466342, 2.6504125359, 2]
9 [0.287041436848, 0.264107869667, 0.29319302508]
10 [0.673829091668, 0.632715325748, 0.47099544284]
11 [3.04589375431, 2.19130582148, 1.68173686657]
how can I transform the data into the pandas DataFrame
I want it as the following data
id data
1 1.61567967235
1 1.55256213176
1 1.16904355984
...
10 0.673829091668
10 0.632715325748
and so on
its large amount of data, if I use the loop to transform it, it kills the notebook, is there any other way to process this data,
sample image of the data

IIUC, from
col
0 [1, 2, 3]
1 [4, 5, 6]
can do
df.col.apply(pd.Series).stack().reset_index(drop=True)
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
or
pd.Series([z for x in df.col.values for z in x])
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
Times:
%timeit df.col.apply(pd.Series).stack().reset_index(drop=True)
1.15 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Series([z for x in df.col.values for z in x])
89.2 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Error : 'Series' object has no attribute 'sort' [duplicate]

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).

Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2

How to convert a column of a dataframe from char to ascii integers? [Pandas]

I have a dataframe in which one columns called 'label' holds values like 'b', 'm', 'n' etc.
I want 'label' to instead hold the ascii equivalent of the letter.
How do I do it?

IIUC:
In [81]:
df = pd.DataFrame({'label':list('bmn')})
df
Out[81]:
label
0 b
1 m
2 n
In [82]:
df['ascii'] = df['label'].apply(ord)
df
Out[82]:
label ascii
0 b 98
1 m 109
2 n 110
It maybe quicker to do a list comprehension:
In [83]:
df['ascii'] = [ord(x) for x in df['label']]
df
Out[83]:
label ascii
0 b 98
1 m 109
2 n 110
You can also use map:
In [85]:
df['ascii'] = df['label'].map(ord)
df
Out[85]:
label ascii
0 b 98
1 m 109
2 n 110
Timings
for a small df:
In [87]:
%timeit [ord(x) for x in df['label']]
%timeit df['label'].map(ord)
%timeit df['label'].apply(ord)
100000 loops, best of 3: 14 µs per loop
10000 loops, best of 3: 123 µs per loop
10000 loops, best of 3: 146 µs per loop
For a 3K df:
In [89]:
%timeit [ord(x) for x in df['label']]
%timeit df['label'].map(ord)
%timeit df['label'].apply(ord)
1000 loops, best of 3: 246 µs per loop
1000 loops, best of 3: 1 ms per loop
1000 loops, best of 3: 1.02 ms per loop
So here the list comprehension scales better than the other methods

e.g. "a"=97 in ascii}
write print(ord("a"))
print(ord("a"))
answer would be 97

Pandas Set Top Row as MultiIndex Level 1

Given the following data frame:
d2=pd.DataFrame({'Item':['items','y','z','x'],
'other':['others','bb','cc','dd']})
d2
Item other
0 items others
1 y bb
2 z cc
3 x dd
I'd like to create a multiindexed set of headers such that the current headers become level 0 and the current top row becomes level 1.
Thanks in advance!

Another solution is create MultiIndex.from_tuples:
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
print (pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:]))
Item other
0 items others
1 y bb
2 z cc
3 x dd
Or if column names are not important:
cols = list(zip(d2.columns, d2.iloc[0,:]))
d2.columns = pd.MultiIndex.from_tuples(cols)
print (d2[1:])
Item other
items others
1 y bb
2 z cc
3 x dd
Timings:
len(df)=400k:
In [63]: %timeit jez(d22)
100 loops, best of 3: 6.22 ms per loop
In [64]: %timeit piR(d2)
10 loops, best of 3: 84.9 ms per loop
len(df)=40:
In [70]: %timeit jez(d22)
The slowest run took 4.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 941 µs per loop
In [71]: %timeit piR(d2)
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.36 ms per loop
Code:
import pandas as pd
d2=pd.DataFrame({'Item':['items','y','z','x'],
'other':['others','bb','cc','dd']})
print (d2)
d2 = pd.concat([d2]*100000).reset_index(drop=True)
#d2 = pd.concat([d2]*10).reset_index(drop=True)
d22 = d2.copy()
def piR(d2):
return (d2.T.set_index(0, append=1).T)
def jez(d2):
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
return pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:])
print (piR(d2))
print (jez(d22))
print ((piR(d2) == jez(d22)).all())
Item items True
other others True
dtype: bool

Transpose the DataFrame, set_index with the first column with parameter append = True, then Transpose back.
d2.T.set_index(0, append=1).T

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes [duplicate] - python-3.x

How can I select rows from a DataFrame based on values in some column in Pandas? In SQL, I would use: SELECT * FROM table WHERE column_name = some_value

Related

Sub-Totals & Total In Pandas [duplicate]

how can I transform the following data into pandas dataframe [duplicate]

Error : 'Series' object has no attribute 'sort' [duplicate]

How to convert a column of a dataframe from char to ascii integers? [Pandas]

Pandas Set Top Row as MultiIndex Level 1

Categories

Resources