DataFrameGroupby.agg NamedAgg on same column errors out on custom function, but works on bult-in function - python-3.x

Setup
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
Out[127]:
A B C
0 1 45 71
1 1 48 89
2 2 65 89
3 2 68 13
4 2 68 59
5 3 10 66
6 7 84 40
7 7 22 88
8 9 37 47
9 10 88 89
f = lambda x: x.max()
NamedAgg on built-in function works fine
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', 'max'), C_max=('C', 'max'))
Out[133]:
B_min B_max C_max
A
1 45 48 89
2 65 68 89
3 10 10 66
7 22 84 88
9 37 37 47
10 88 88 89
NamedAgg on custom function f errors out
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', f), C_max=('C', 'max'))
KeyError: "[('B', '<lambda>')] not in index"
Is there any explanation for this error? is this error an intentional restriction?

The issue is because of _mangle_lambda_list, which gets called at some point. There seems to be a mismatch where the resulting aggregation gets renamed but the list of output columns, ordered which are then used here, doesn't get changed. Since that function specifically checks for if com.get_callable_name(aggfunc) == "<lambda>" any name other than '<lambda>' will work without issue:
Sample data
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
f = lambda x: x.max()
kwargs = {'B_min': ('B', 'min'), 'B_max':('B', f), 'C_max':('C', 'max')}
Here are the most relevant major steps that get called when you aggregate, and we can see where the KeyError comes from.
func, columns, order = pd.core.groupby.generic._normalize_keyword_aggregation(kwargs)
print(order)
#[('B', 'min'), ('B', '<lambda>'), ('C', 'max')]
func = pd.core.groupby.generic._maybe_mangle_lambdas(func)
df.groupby('A')._aggregate(func)
# B C
# min <lambda_0> max # _0 ruins indexing with ('B', '<lambda>')
#A
#1 45 48 89
#2 65 68 89
#3 10 10 66
#7 22 84 88
#9 37 37 47
#10 88 88 89
Because _mangle_lambda_list is only called when there are multiple aggregations for the same column, you can get away with the '<lambda>' name, so long as it is the only aggregation for that column.
df.groupby('A').agg(A_min=('A', 'min'), B_max=('B', f))
# A_min B_max
#A
#1 1 48
#2 2 68
#3 3 10
#7 7 84
#9 9 37
#10 10 88

Related

pandas dataframe slicing to a subset from row #y1 to row #y2

I can't see the forest for the trees right now:
I have a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({'UTCs': [32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783],
'Temperature': [5, 7, 7, 9, 12, 9, 9, 4],
'Humidity': [50, 50, 48, 47, 46, 47, 48, 52],
'pressure': [998, 998, 999, 999, 999, 999, 1000, 1000]})
print(df)
UTCs Temperature Humidity pressure
0 32776 5 50 998
1 32777 7 50 998
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
7 32783 4 52 1000
Now I want to create a subset of all dataset columns for UTCs between 32778 and 32782
I can chose a subset with:
df_sub=df.iloc[2:7,:]
print(df_sub)
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
But how can I do that with the condition like 'chose rows between UTCs=32778 and UTCs=32782'?
Something like
df_sub = df.iloc[df[df.UTCs == 32778] : df[df.UTCs == 32783], : ]
does not work.
Any hint for me?
Use between for boolean indexing:
df_sub = df[df['UTCs'].between(32778, 32783, inclusive='left')]
output:
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000

How to rearrange a pandas dataframe having N columns and append N columns together in python?

I have a dataframe df as shown below,A index,B Index and C Index appear as headers
and each of them have sub header as the Last price
Input
A index B Index C Index
Date Last Price Date Last Price Date Last Price
1/10/2021 12 1/11/2021 46 2/9/2021 67
2/10/2021 13 2/11/2021 51 3/9/2021 70
3/10/2021 14 3/11/2021 62 4/9/2021 73
4/10/2021 15 4/11/2021 47 5/9/2021 76
5/10/2021 16 5/11/2021 51 6/9/2021 79
6/10/2021 17 6/11/2021 22 7/9/2021 82
7/10/2021 18 7/11/2021 29 8/9/2021 85
I want to transform the to the below dataframe.
Expected Output
Date Index Name Last Price
1/10/2021 A index 12
2/10/2021 A index 13
3/10/2021 A index 14
4/10/2021 A index 15
5/10/2021 A index 16
6/10/2021 A index 17
7/10/2021 A index 18
1/11/2021 B Index 46
2/11/2021 B Index 51
3/11/2021 B Index 62
4/11/2021 B Index 47
5/11/2021 B Index 51
6/11/2021 B Index 22
7/11/2021 B Index 29
2/9/2021 C Index 67
3/9/2021 C Index 70
4/9/2021 C Index 73
5/9/2021 C Index 76
6/9/2021 C Index 79
7/9/2021 C Index 82
8/9/2021 C Index 85
How can this be done in pandas dataframe?
The structure of your df is not clear from your output. It would be useful if you provided Python code that creates an example, or at the very lest the output of df.columns. Now let us assume it is a 2-level multindex created as such:
columns = pd.MultiIndex.from_tuples([('A index','Date'), ('A index','Last Price'),('B index','Date'), ('B index','Last Price'),('C index','Date'), ('C index','Last Price')])
data = [
['1/10/2021', 12, '1/11/2021', 46, '2/9/2021', 67],
['2/10/2021', 13, '2/11/2021', 51, '3/9/2021', 70],
['3/10/2021', 14, '3/11/2021', 62, '4/9/2021', 73],
['4/10/2021', 15, '4/11/2021', 47, '5/9/2021', 76],
['5/10/2021', 16, '5/11/2021', 51, '6/9/2021', 79],
['6/10/2021', 17, '6/11/2021', 22, '7/9/2021', 82],
['7/10/2021', 18, '7/11/2021', 29, '8/9/2021', 85],
]
df = pd.DataFrame(columns = columns, data = data)
Then what you are trying to do is basically an application of .stack with some re-arrangement after:
(df.stack(level = 0)
.reset_index(level=1)
.rename(columns = {'level_1':'Index Name'})
.sort_values(['Index Name','Date'])
)
this produces
Index Name Date Last Price
0 A index 1/10/2021 12
1 A index 2/10/2021 13
2 A index 3/10/2021 14
3 A index 4/10/2021 15
4 A index 5/10/2021 16
5 A index 6/10/2021 17
6 A index 7/10/2021 18
0 B index 1/11/2021 46
1 B index 2/11/2021 51
2 B index 3/11/2021 62
3 B index 4/11/2021 47
4 B index 5/11/2021 51
5 B index 6/11/2021 22
6 B index 7/11/2021 29
0 C index 2/9/2021 67
1 C index 3/9/2021 70
2 C index 4/9/2021 73
3 C index 5/9/2021 76
4 C index 6/9/2021 79
5 C index 7/9/2021 82
6 C index 8/9/2021 85

define range in pandas column based on define input from list

I have one data frame, wherein I need to apply range in one column, based on the list provided,
I am able to achieve results using fixed values but input values will be dynamic in a list format and the range will be based on input.
MY Data frame looks like below:
import pandas as pd
rangelist=[90,70,50]
data = {'Result': [75,85,95,45,76,8,10,44,22,65,35,67]}
sampledf=pd.DataFrame(data)
range list is my list, from that I need to create range like 100-90,90-70 & 70-50. These ranges may differ from time to time, till now I am achieving results using the below function.
def cat(value):
cat=''
if (value>90):
cat='90-100'
if (value<90 and value>70 ):
cat='90-70'
else:
cat='< 50'
return cat
sampledf['category']=sampledf['Result'].apply(cat)
How can I pass dynamic value in function"cat" based on the range list? I will be grateful if someone can help me to achieve the below result.
Result category
0 75 90-70
1 85 90-70
2 95 < 50
3 45 < 50
4 76 90-70
5 8 < 50
6 10 < 50
7 44 < 50
8 22 < 50
9 65 < 50
10 35 < 50
11 67 < 50
I would recommend pd.cut for this:
sampledf['Category'] = pd.cut(sampledf['Result'],
[-np.inf] + sorted(rangelist) + [np.inf])
Output:
Result Category
0 75 (70.0, 90.0]
1 85 (70.0, 90.0]
2 95 (90.0, inf]
3 45 (-inf, 50.0]
4 76 (70.0, 90.0]
5 8 (-inf, 50.0]
6 10 (-inf, 50.0]
7 44 (-inf, 50.0]
8 22 (-inf, 50.0]
9 65 (50.0, 70.0]
10 35 (-inf, 50.0]
11 67 (50.0, 70.0]
import numpy as np
breaks = pd.Series([100, 90, 75, 50, 45, 20, 0])
sampledf["ind"] = sampledf.Result.apply(lambda x: np.where(x >= breaks)[0][0])
sampledf["category"] = sampledf.ind.apply(lambda i: (breaks[i], breaks[i-1]))
sampledf
# Result ind category
# 0 75 2 (75, 90)
# 1 85 2 (75, 90)
# 2 95 1 (90, 100)
# 3 45 4 (45, 50)
# 4 76 2 (75, 90)
# 5 8 6 (0, 20)
# 6 10 6 (0, 20)
# 7 44 5 (20, 45)
# 8 22 5 (20, 45)
# 9 65 3 (50, 75)
# 10 35 5 (20, 45)
# 11 67 3 (50, 75)

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

How to sort a pandas dataframe by the standard deviations of its columns?

Given the following example :
from statistics import stdev
d = pd.DataFrame({"a": [45, 55], "b": [5, 95], "c": [30, 70]})
stds = [stdev(d[c]) for c in d.columns]
With output:
In [87]: d
Out[87]:
a b c
0 45 5 30
1 55 95 70
In [91]: stds
Out[91]: [7.0710678118654755, 63.63961030678928, 28.284271247461902]
I would like to be able to sort the columns of the dataframe by their
standard deviations, resulting in the following
b c a
0 5 30 45
1 95 70 55
you are looking for:
d.iloc[:,(-d.std()).argsort()]
Out[8]:
b c a
0 5 30 45
1 95 70 55
You can get the column order like this:
column_order = d.std().sort_values(ascending=False).index
# >>> column_order
# Index(['b', 'c', 'a'], dtype='object')
And then sort the columns like this:
d[column_order]
b c a
0 5 30 45
1 95 70 55

Resources