Pandas Calculate CAGR with Slicing (missing values) - python-3.x

As a follow-up to this question,
I'd like to calculate the CAGR from a pandas data frame such as this, where there are some missing data values:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,np.nan,4],
'C' : [5,6,7,1],
'D' : [np.nan,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 NaN
2 6 6 9
3 NaN 7 9
7 4 1 8
Thanks in advance!

When calculating returns from a level, it's ok to use most recent available. For example, when calculating CAGR for row 1, we want to use (5/7) ^ (1/3) - 1. Also, for row 3 (9/7) ^ (1/3). There is an assumption made that we annualize across all years looked at.
With these assumptions:
df = df.bfill(axis=1).ffill(axis=1)
Then apply solution from linked question.
df['CAGR'] = df.T.pct_change().add(1).prod().pow(1./(len(df.columns) - 1)).sub(1)
With out this assumption. The only other reasonable choice would be to annualize by the number of non-NaN observations. So I need to track that with:
notnull = df.notnull().sum(axis=1)
df = df.bfill(axis=1).ffill(axis=1)
df['CAGR'] = df.T.pct_change().add(1).prod().pow(1./(notnull.sub(1))).sub(1)
In fact, this becomes the more general solution as it will work with the case with out nulls as well.

Related

How can you extract a value from an index of an index in Pandas [duplicate]

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?
Slicing based on a single value/label
Slicing based on multiple labels from one or more levels
Filtering on boolean conditions and expressions
Which methods are applicable in what circumstances
Assumptions for simplicity:
input dataframe does not have duplicate index keys
input dataframe below only has two levels. (Most solutions shown here generalize to N levels)
Example input:
mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])
df = pd.DataFrame({'col': np.arange(len(mux))}, mux)
col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15
Question 1: Selecting a Single Item
How do I select rows having "a" in level "one"?
col
one two
a t 0
u 1
v 2
w 3
Additionally, how would I be able to drop level "one" in the output?
col
two
t 0
u 1
v 2
w 3
Question 1b
How do I slice all rows with value "t" on level "two"?
col
one two
a t 0
b t 4
t 8
d t 12
Question 2: Selecting Multiple Values in a Level
How can I select rows corresponding to items "b" and "d" in level "one"?
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?
col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
Question 3: Slicing a Single Cross Section (x, y)
How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by
col
one two
c u 9
Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]
How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?
col
one two
c u 9
a w 3
Question 5: One Item Sliced per Level
How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?
col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
Question 6: Arbitrary Slicing
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
Question 7 will use a unique setup consisting of a numeric level:
np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
np.random.choice(10, size=16)
], names=['one', 'two'])
df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)
col
one two
a 5 0
0 1
3 2
3 3
b 7 4
9 5
3 6
5 7
2 8
c 4 9
7 10
d 6 11
8 12
8 13
1 14
6 15
Question 7: Filtering by numeric inequality on individual levels of the multiindex
How do I get all rows where values in level "two" are greater than 5?
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).
MultiIndex / Advanced Indexing
Note
This post will be structured in the following manner:
The questions put forth in the OP will be addressed, one by one
For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.
Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details,
and other info cursory to the topic at hand. These notes have been
compiled through scouring the docs and uncovering various obscure
features, and from my own (admittedly limited) experience.
All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not
find a solution applicable to your use case, please feel free to
suggest an edit, request clarification in the comments, or open a new
question, ....as applicable.
Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting
DataFrame.loc - A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)
DataFrame.xs - Extract a particular cross section from a Series/DataFrame.
DataFrame.query - Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.
Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.
It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.
Question 1
How do I select rows having "a" in level "one"?
col
one two
a t 0
u 1
v 2
w 3
You can use loc, as a general purpose solution applicable to most situations:
df.loc[['a']]
At this point, if you get
TypeError: Expected tuple, got str
That means you're using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].
Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).
df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)
Here, the drop_level=False argument is needed to prevent xs from dropping level "one" in the result (the level we sliced on).
Yet another option here is using query:
df.query("one == 'a'")
If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".
Finally, using get_level_values:
df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']
Additionally, how would I be able to drop level "one" in the output?
col
two
t 0
u 1
v 2
w 3
This can be easily done using either
df.loc['a'] # Notice the single string argument instead the list.
Or,
df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')
Notice that we can omit the drop_level argument (it is assumed to be True by default).
Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,
v = df.loc[['a']]
print(v)
col
one two
a t 0
u 1
v 2
w 3
print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
names=['one', 'two'])
You can get rid of these levels using MultiIndex.remove_unused_levels:
v.index = v.index.remove_unused_levels()
print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
names=['one', 'two'])
Question 1b
How do I slice all rows with value "t" on level "two"?
col
one two
a t 0
b t 4
t 8
d t 12
Intuitively, you would want something involving slice():
df.loc[(slice(None), 't'), :]
It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.
idx = pd.IndexSlice
df.loc[idx[:, 't'], :]
This is much, much cleaner.
Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or
axis=1). Without explicitly making it clear which axis the slicing
is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.
If you want to remove any shade of ambiguity, loc accepts an axis
parameter:
df.loc(axis=0)[pd.IndexSlice[:, 't']]
Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns,
and a KeyError will be raised in this circumstance.
This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.
With xs, it is
df.xs('t', axis=0, level=1, drop_level=False)
With query, it is
df.query("two == 't'")
# Or, if the first level has no name,
# df.query("ilevel_1 == 't'")
And finally, with get_level_values, you may do
df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']
All to the same effect.
Question 2
How can I select rows corresponding to items "b" and "d" in level "one"?
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Using loc, this is done in a similar fashion by specifying a list.
df.loc[['b', 'd']]
To solve the above problem of selecting "b" and "d", you can also use query:
items = ['b', 'd']
df.query("one in #items")
# df.query("one == #items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')
Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn't conventionally python. The
Pandas parser generates a slightly different parse tree from the
expression. This is done to make some operations more intuitive to
specify. For more information, please read my post on
Dynamic Expression Evaluation in pandas using pd.eval().
And, with get_level_values + Index.isin:
df[df.index.get_level_values("one").isin(['b', 'd'])]
Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?
col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
With loc, this is possible only in conjuction with pd.IndexSlice.
df.loc[pd.IndexSlice[:, ['t', 'w']], :]
The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.
With query, this is
items = ['t', 'w']
df.query("two in #items")
# df.query("two == #items", parser='pandas')
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')
With get_level_values and Index.isin (similar to above):
df[df.index.get_level_values('two').isin(['t', 'w'])]
Question 3
How do I retrieve a cross section, i.e., a single row having a specific values
for the index from df? Specifically, how do I retrieve the cross
section of ('c', 'u'), given by
col
one two
c u 9
Use loc by specifying a tuple of keys:
df.loc[('c', 'u'), :]
Or,
df.loc[pd.IndexSlice[('c', 'u')]]
Note
At this point, you may run into a PerformanceWarning that looks like this:
PerformanceWarning: indexing past lexsort depth may impact performance.
This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your
DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing
multiple such queries in tandem:
df_sort = df.sort_index()
df_sort.loc[('c', 'u')]
You can also use MultiIndex.is_lexsorted() to check whether the index
is sorted or not. This function returns True or False accordingly.
You can call this function to determine whether an additional sorting
step is required or not.
With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:
df.xs(('c', 'u'))
With query, things become a bit clunky:
df.query("one == 'c' and two == 'u'")
You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.
With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:
m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]
Question 4
How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?
col
one two
c u 9
a w 3
With loc, this is still as simple as:
df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]
With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:
cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses)
query = '(' + ') or ('.join([
' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)])
for cs in cses
]) + ')'
print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))
df.query(query)
100% DO NOT RECOMMEND! But it is possible.
What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you're not checking, then use isin to test membership, and then boolean index on the final result.
df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]
Question 5
How can I retrieve all rows corresponding to "a" in level "one" or
"t" in level "two"?
col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:
pd.concat([
df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
col
one two
a t 0
u 1
v 2
w 3
t 0 # Does this look right to you? No, it isn't!
b t 4
t 8
d t 12
But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do
v = pd.concat([
df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]
But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.
With query, this is stupidly simple:
df.query("one == 'a' or two == 't'")
With get_level_values, this is still simple, but not as elegant:
m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2]
Question 6
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and
for "d", I would like to select rows with sub-level "w".
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.
Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:
keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]
If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and concat the result:
pd.concat([
df.loc[(('a', 'b'), ('u', 'v')), :],
df.loc[('d', 'w'), :]
], axis=0)
Slicing specification for "a" and "b" is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.
Question 7
How do I get all rows where values in level "two" are greater than 5?
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
This can be done using query,
df2.query("two > 5")
And get_level_values.
df2[df2.index.get_level_values('two') > 5]
Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and
get_level_values are helpful for building general conditional masks
for filtering.
Bonus Question
What if I need to slice a MultiIndex column?
Actually, most solutions here are applicable to columns as well, with minor changes. Consider:
np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
list('ABCD'), list('efgh')
], names=['one','two'])
df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)
one A B C D
two e f g h e f g h e f g h e f g h
0 5 0 3 3 7 9 3 5 2 4 7 6 8 8 1 6
1 7 7 8 1 5 9 8 9 4 3 0 3 5 0 2 3
2 8 1 3 3 3 7 0 1 9 9 0 4 7 3 2 7
These are the following changes you will need to make to the Four Idioms to have them working with columns.
To slice with loc, use
df3.loc[:, ....] # Notice how we slice across the index with `:`.
or,
df3.loc[:, pd.IndexSlice[...]]
To use xs as appropriate, just pass an argument axis=1.
You can access the column level values directly using df.columns.get_level_values. You will then need to do something like
df.loc[:, {condition}]
Where {condition} represents some condition built using columns.get_level_values.
To use query, your only option is to transpose, query on the index, and transpose again:
df3.T.query(...).T
Not recommended, use one of the other 3 options.
Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn't make any of the solutions above produce the results I was looking for. It's quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.
I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.
This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)
Specifically I was looking for:
A way to choose two+ values from one level of the index and a single value from another level of the index, and
A way to leave the index values from the previous operation in the dataframe output.
As a monkey wrench in the gears (however totally fixable):
The indexes were unnamed.
On the toy dataframe below:
index = pd.MultiIndex.from_product([['a','b'],
['stock1','stock2','stock3'],
['price','volume','velocity']])
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18],
index)
0
a stock1 price 1
volume 2
velocity 3
stock2 price 4
volume 5
velocity 6
stock3 price 7
volume 8
velocity 9
b stock1 price 10
volume 11
velocity 12
stock2 price 13
volume 14
velocity 15
stock3 price 16
volume 17
velocity 18
Using the below works, of course:
df.xs(('stock1', 'velocity'), level=(1,2))
0
a 3
b 12
But I wanted a different result, so my method to get that result was:
df.iloc[df.index.isin(['stock1'], level=1) &
df.index.isin(['velocity'], level=2)]
0
a stock1 velocity 3
b stock1 velocity 12
And if I wanted two+ values from one level and a single (or 2+) value from another level:
df.iloc[df.index.isin(['stock1','stock3'], level=1) &
df.index.isin(['velocity'], level=2)]
0
a stock1 velocity 3
stock3 velocity 9
b stock1 velocity 12
stock3 velocity 18
The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.
This looks like a great case for dfsql
df.sql(<SQL select statement>)
https://github.com/mindsdb/dfsql
A complete article about it here:
https://medium.com/riselab/why-every-data-scientist-using-pandas-needs-modin-bringing-sql-to-dataframes-3b216b29a7c0
I have long used and appreciate this question, and #cs95's response, which is very thorough and handles all instances. Similar to #r-a's answer, I too wanted to find a way to work with multiple indices that contained multiple levels.
I finally found a way to obtain an arbitrary number of slices given a level or a named index, which is able to handle several of the questions proposed above. The major improvement here is not having to parse out slice(None) or the : with pd.IndexSlice for multiple indexes, or slices.
import pandas as pd
def slice_df_by(df_, slice_by=["Oman", "Nairobi",], slice_idx='country'):
idxn = df_.index.names.index(slice_idx)
return df_.loc[tuple([slice(None)]*idxn +[slice_by] ), :]
gender = tuple(["male", "female"]*6)
thrown = tuple(["rock", "scissors", "paper"]*4)
country = tuple(["Nairobi", "Oman", "Djibouti", "Belize"]*3)
names = tuple(["Chris", "Pat", "Michele", "Thomy", "Musa", "Casey"]*2)
tuples = list(zip(gender, thrown, country, names))
idx = pd.MultiIndex.from_tuples(tuples,
names=["gender", "thrown", "country", "name"])
df = pd.DataFrame({'Count A': [12., 70., 30., 20.]*3,
'Count B': [12., 70., 30., 20.]*3}, index=idx)
The benefit here is that you can add any combination of these calls to the function slice_df_by to get more complicated slices while only using the index name and a list of values.
print(slice_df_by(df))
Count A Count B
gender thrown country name
female scissors Oman Pat 70.0 70.0
paper Oman Casey 70.0 70.0
rock Oman Thomy 70.0 70.0
male rock Nairobi Chris 12.0 12.0
scissors Nairobi Musa 12.0 12.0
paper Nairobi Michele 12.0 12.0
The catch, as #r-a pointed out is not having named indices. There are plenty of ways to satisfy this using the approach here, such as df.index.names = ["names", "for", "the", "indices"] or some such method:
idxz = lambda ixln=4: [chr(i) for i in np.arange(ixln)+65]
df.index.names = idxz(len(df.index.names))
print(idxz())
Out[132]: ['A', 'B', 'C', 'D']
One option is with select_rows from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
Question 1
How do I select rows having "a" in level "one"?
df.select_rows('a')
col
one two
a t 0
u 1
v 2
w 3
Additionally, how would I be able to drop level "one" in the output?
df.select_rows('a').droplevel('one')
col
two
t 0
u 1
v 2
w 3
Question 1b
How do I slice all rows with value "t" on level "two"?
col
one two
a t 0
b t 4
t 8
d t 12
Use a dictionary here, specify the level as a key, and pass the labels to select:
df.select_rows({'two':'t'})
col
one two
a t 0
b t 4
t 8
d t 12
Question 2
How can I select rows corresponding to items "b" and "d" in level "one"?
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Since selection is on a single level, pass a list of the labels:
df.select_rows(['b','d'])
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?
col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
Use a dictionary:
df.select_rows({'two':['t','w']})
col
one two
a t 0
b t 4
t 8
d t 12
a w 3
b w 7
d w 11
w 15
Question 3
How do I retrieve a cross section, i.e., a single row having a specific values
for the index from df? Specifically, how do I retrieve the cross
section of ('c', 'u'), given by
col
one two
c u 9
We are going across levels (horizontally, not vertically), a tuple is required:
# sort required to avoid lexsort performance warning
df.sort_index().select_rows(('c','u'))
col
one two
c u 9
Question 4
How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?
col
one two
c u 9
a w 3
select_rows accepts multiple variable arguments:
df.sort_index().select_rows(('c','u'), ('a','w'))
col
one two
c u 9
a w 3
Question 5
How can I retrieve all rows corresponding to "a" in level "one" or
"t" in level "two"?
col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
df.select_rows('a', {'two':'t'})
col
one two
a t 0
u 1
v 2
w 3
t 0
b t 4
t 8
d t 12
Question 6
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and
for "d", I would like to select rows with sub-level "w".
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
df.sort_index().select_rows({'one':['a','b'], 'two':['u','v']}, ('d','w'))
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
Question 7
How do I get all rows where values in level "two" are greater than 5?
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
With a dictionary, you can pass a function, as long as it can be evaluated on an Index object:
df2.select_rows({'two': lambda df: df > 5})
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
You can select on columns with the select_columns function. There is also a generic select function for selecting on both rows and columns.
The functions are extensible: let's see how it works with #double0darbo answer:
df.select_rows({'country':['Oman', 'Nairobi']})
Count A Count B
gender thrown country name
female scissors Oman Pat 70.0 70.0
paper Oman Casey 70.0 70.0
rock Oman Thomy 70.0 70.0
male rock Nairobi Chris 12.0 12.0
scissors Nairobi Musa 12.0 12.0
paper Nairobi Michele 12.0 12.0
Attempting #r a answer as well:
df.select_rows({1:'stock1', 2:'velocity'})
0
a stock1 velocity 3
b stock1 velocity 12
df.select_rows({1:['stock1','stock3'], 2:'velocity'})
0
a stock1 velocity 3
b stock1 velocity 12
a stock3 velocity 9
b stock3 velocity 18
df.select_rows({0:slice('a',None), 1:['stock1','stock3'], 2:'velocity'})
0
a stock1 velocity 3
stock3 velocity 9
b stock1 velocity 12
stock3 velocity 18

python3.7 & pandas - use column value in row as lookup value to return different column value

I've got a tricky situation - tricky for me since I'm really new to python. I've got a dataframe in pandas and I need to logic my way through building a new column that will be used later in a data match from a difference source. Basically, the picture tells what I can't figure out.
For any of the LOW labels I need to retrieve their MID_LEVEL label and copy it to a new column. The DESIRED OUTPUT column is what I need to create.
You can see that the LABEL_PATH is formatted in a way that I can use the first 9 digits as a "lookup" to find the corresponding LABEL, but I can't figure out how to achieve that. As an example, for any row that the LABEL_PATH starts with "0.02.0004" the desired output needs to be "MID_LEVEL1".
This dataset has around 25k rows, so wanted to avoid row iteration as well.
Any help would be greatly appreciated!
Chosing a similar example as you did:
df = pd.DataFrame({"a":["1","1.1","1.1.1","1.1.2","2"],"b":range(5)})
df["c"] = np.nan
mask = df.a.apply(lambda x: len(x.split(".")) < 3)
df.loc[mask,"c"] = df.b[mask]
df.c.fillna(method="ffill", inplace=True)
Most of the magic takes place in the line where mask is defined, but it's not that difficult: if the value in a gets split into less than 3 parts (i.e., has at most one dot), mark it as True, otherwise not.
Use that mask to copy over the values, and then fill unspecified values with valid values from above.
I am using this data for comparison :
test_dict = {"label_path": [1, 2, 3, 4, 5, 6], "label": ["low1", "low2", "mid1", "mid2", "high1", "high2"], "desired_output": ["mid1", "mid2", "mid1", "mid2", "high1", "high2"]}
df = pd.DataFrame(test_dict)
Which gives :
label_path label desired_output
0 1 low1 mid1
1 2 low2 mid2
2 3 mid1 mid1
3 4 mid2 mid2
4 5 high1 high1
5 6 high2 high2
With a bit ogf logic and a merge :
desired_label_df = df.drop_duplicates("desired_output", keep="last")
desired_label_df = desired_label_df[["label_path", "desired_output"]]
desired_label_df.columns = ["desired_label_path", "desired_output"]
df = df.merge(desired_label_df, on="desired_output", how="left")
Gives us :
label_path label desired_output desired_label_path
0 1 low1 mid1 3
1 2 low2 mid2 4
2 3 mid1 mid1 3
3 4 mid2 mid2 4
4 5 high1 high1 5
5 6 high2 high2 6
Edit: if you want to create the desired_output column, just do the following :
df["desired_output"] = df["label"].apply(lambda x: x.replace("low", "mid"))

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

Setting a value to a cell in a pandas dataframe

I have the following pandas dataframe:
K = pd.DataFrame({"A":[1,2,3,4], "B":[5,6,7,8]})
Then I set the cell in the first row and first column to 11:
K.iloc[0]["A"] = 11
And when I check the dataframe again, I see that the value assignment is done and K.iloc[0]["A"] is equal to 11. However when I add a column to this data frame and do the same operation for a cell in the new column, the value assignment is not successful:
K["C"] = 0
K.iloc[0]["C"] = 11
So, when I check the dataframe again, the value of K.iloc[0]["C"] is still zero. I appreciate if somebody can tell me what is going on here and how I can resolve this issue.
For simplicity, I would do the operations in a different order and use loc:
K.loc[0, 'C'] = 0
K.loc[0, ['A', 'C']] = 11
When you use K.iloc[0]["C"], you first take the first line, so you have a copy of a slice from your dataframe, then you take the column C. So you change the copy from the slice, not the original dataframe.
That your first call, K.iloc[0]["A"] = 11 worked fine was in some sens a luck.
The good habit is to use loc in "one shot", so you have access to the original value of the dataframe, not on a slice copy :
K.loc[0,"C"] = 11
Be careful that iloc and loc are different function, even if they seems quite similar here.
If default index, RangeIndex is possible use DataFrame.loc, but it set index values by label 0 (what is same like position 0):
K['C'] = 0
K.loc[0, ["A", "C"]] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
Reason why your solution failed is possible find in docs:
This can work at times, but it is not guaranteed to, and therefore should be avoided:
dfc['A'][0] = 111
Solution with DataFrame.iloc is possible with get positions of columns by Index.get_indexer:
print (K.columns.get_indexer(["A", "C"]))
[0 2]
K['C'] = 0
K.iloc[0, K.columns.get_indexer(["A", "C"])] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
loc should work :
K.loc[0]['C'] = 11
K.loc[0, 'C'] = 11
Both the above versions of loc will be able to assign values to the dataframe K.

Python Pandas: Get index of rows which column matches certain value (max) [duplicate]

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

Resources