How to get an item's group mean but exclude the item itself? - featuretools

How can I use feature tools to get a mean value of the group which the item belongs to, but excludes the item itself?
For example,
Input:
item group value1
I1 C1 1
I2 C2 5
I3 C2 3
I4 C2 8
I5 C1 4
I6 C1 5
I7 C1 6
I8 C2 4
I9 C3 2
I10 C3 3
Expected output:
item mean_value1_peergroup
I1 5 #mean([4,5,6]) rather than mean([1, 4, 5, 6])
I2 5 #mean(3,8,4)
...
I10 2 #mean([2])

This can be done with a custom transform primitive. You'd define the primitive like this
import pandas as pd
import featuretools as ft
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
class MeanExcludingValue(TransformPrimitive):
name = "mean_excluding_value"
input_types = [Numeric]
return_type = Numeric
stack_on_self = False
def get_function(self):
def mean_excluding_value(s):
"""calculate the mean of the group excluding the current element"""
return (s.sum() - s) / len(s)
return mean_excluding_value
Now, let's create a sample of data like your example and load it into an entity set.
df = pd.DataFrame({
"item": ["I1", "I2", "I3", "I4", "I5", "I6", "I7", "I8", "I9", "I10"],
"group": ["C1", "C1", "C1", "C1", "C1", "C1", "C1", "C2", "C2", "C3"],
"value1": [1, 5, 3, 8, 4, 5, 6, 4, 2, 3]
})
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="example",
dataframe=df,
index="item",
variable_types={
"group": ft.variable_types.Id # this is important for grouping later
})
Finally, we call dfs with the new primitive.
fm, fl = ft.dfs(target_entity="example",
entityset=es,
trans_primitives=[MeanExcludingValue],
groupby_trans_primitives=[MeanExcludingValue],
max_depth=1)
fm
this returns
value1 group MEAN_EXCLUDING_VALUE(value1) MEAN_EXCLUDING_VALUE(value1) by group
item
I1 1 C1 4.0 4.428571
I10 3 C3 3.8 0.000000
I2 5 C1 3.6 3.857143
I3 3 C1 3.8 4.142857
I4 8 C1 3.3 3.428571
I5 4 C1 3.7 4.000000
I6 5 C1 3.6 3.857143
I7 6 C1 3.5 3.714286
I8 4 C2 3.7 1.000000
I9 2 C2 3.9 2.000000
You can read more about the difference between trans_primitives and groupby_trans_primitives here.

Related

Pandas Dataframe array entries as rows [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 6 months ago.
Use pd.DataFrame.explode() method: How to unnest (explode) a column in a pandas DataFrame, into multiple rows
I have a pandas.Dataframe structure with arrays as entries which I would like to disaggregate each of the entries into a long format.
Below is the code to reproduce what I am looking for. StackOverflow is asking me to put more detain into this draft because it is mostly code, but it is mostly code because it allows the reader to reproduce the issue more clearly.
import pandas as pd
import numpy as np
date = '08-30-2022'
ids = ['s1', 's2']
g1 = ['b1', 'b2']
g2 = ['b1', 'b3', 'b4']
g_ls = [g1, g2]
v1 = [2.0, 2.5]
v2 = [3.2, np.nan, 3.7]
v_ls = [v1, v2]
dict = {
'date': [date] * len(ids),
'ids': ids,
'group': g_ls,
'values': v_ls
}
df_in = pd.DataFrame.from_dict(dict)
dict_out = {
'date': [date] * 5,
'ids': ['s1', 's1', 's2', 's2', 's2'],
'group': ['b1', 'b2', 'b1', 'b3', 'b4'],
'values': [2.0, 2.5, 3.2, np.nan, 3.7]
}
desired_df = pd.DataFrame.from_dict(dict_out)
Have:
date ids group values
0 08-30-2022 s1 [b1, b2] [2.0, 2.5]
1 08-30-2022 s2 [b1, b3, b4] [3.2, nan, 3.7]
Want:
date ids group values
0 08-30-2022 s1 b1 2.0
1 08-30-2022 s1 b2 2.5
2 08-30-2022 s2 b1 3.2
3 08-30-2022 s2 b3 NaN
4 08-30-2022 s2 b4 3.7
Try with
df = df_in.explode(['group','values'])
Out[173]:
date ids group values
0 08-30-2022 s1 b1 2.0
0 08-30-2022 s1 b2 2.5
1 08-30-2022 s2 b1 3.2
1 08-30-2022 s2 b3 NaN
1 08-30-2022 s2 b4 3.7

Apply np.where or np.select to multiple column pairs

Given a data df as follows:
import pandas as pd
data = [[1, 'A1', 'A1'], [2, 'A2', 'B2', 1, 1], [3, 'B3', 'B3', 3, 2], [4, None, None]]
df = pd.DataFrame(data, columns=['id', 'v1','v2','v3','v4'])
print(df)
Out:
id v1 v2 v3 v4
0 1 A1 A1 NaN NaN
1 2 A2 B2 1.0 1.0
2 3 B3 B3 3.0 2.0
3 4 None None NaN NaN
Let's say I need to check if multiple column pairs have identical content or same values:
col_pair = {'v1': 'v2', 'v3': 'v4'}
If I don't want to repeat np.where multiple times as follow, instead, I hope to apply col_pair or other possible solutions, how could I acheive that? Thanks.
df['v1_v2'] = np.where(df['v1'] == df['v2'], 1, 0)
df['v3_v4'] = np.where(df['v3'] == df['v4'], 1, 0)
The expected result:
id v1 v2 v3 v4 v1_v2 v3_v4
0 1 A1 A1 NaN NaN 1 NaN
1 2 A2 B2 1.0 1.0 0 1
2 3 B3 B3 3.0 2.0 1 0
3 4 None None NaN NaN NaN NaN
You need test also if both values in pair key-value are missing in DataFrame.isna with DataFrame.all and passed to numpy.select:
for k, v in col_pair.items():
df[f'{k}_{v}'] = np.select([df[[k, v]].isna().all(axis=1),
df[k] == df[v]], [None,1], default=0)
Out:
id v1 v2 v3 v4 v1_v2 v3_v4
0 1 A1 A1 NaN NaN 1 None
1 2 A2 B2 1.0 1.0 0 1
2 3 B3 B3 3.0 2.0 1 0
3 4 None None NaN NaN None None

How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python?

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!
One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1
It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

How to subset a DataFrame based on similar column names

How to subset similar columns in pandas based on keywords like A B C D. Now I have taken this as example is there any better way if new columns were given and logic should work.
df
A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 D3 D4
1 a x 1 a x 3 c 7 d s 4
2 b 5 2 b 5 4 d s c 7 d
3 c 7 3 c 7 1 a x 1 a x
4 d s 4 d s b 5 2 b s 7
You can use pandas.Index.groupby
groups = df.columns.groupby(df.columns.str[0])
#{'A': ['A1', 'A2', 'A3'],
# 'B': ['B1', 'B2', 'B3'],
# 'C': ['C1', 'C2'],
# 'D': ['D1', 'D2', 'D3', 'D4']}
Then you can access data this way:
df[groups['B']]
# B1 B2 B3
#0 1 a x
#1 2 b 5
#2 3 c 7
#3 4 d s
Keep in mind groups is a dict, so you can use any dict method too.
Another approach can be to use df.columns in conjuction with str.contains
a_col_lst = df.columns[df.columns.str.contains('A')]
b_col_lst = df.columns[df.columns.str.contains('B')]
df_A = df.loc[a_col_lst]
df_B = df.loc[b_col_lst]
You can apply regex as well within columns.str.contains
You could use filter along with a regex pattern, e.g.
df_A = df.filter(regex=(r'^A.*'))
You could also use select along with startswith:
df_A = df.select(lambda col: col.startswith('A'), axis=1)

How to subset a DataFrame by only a column having multiple entries?

I have a pandas DataFrame df that looks like this:
0 1
C1 V1
C2 V1
C3 V1
C4 V2
C5 V3
C6 V3
C7 V4
I wish to subset df by only those rows that have multiple values in column 1, the desired output being:
0 1
C1 V1
C2 V1
C3 V1
C5 V3
C6 V3
How do I do this?
I think you need boolean indexing with mask created by DataFrame.duplicated with keep=False for mark all duplicates as True:
print (df.columns)
Index(['0', '1'], dtype='object')
mask = df.duplicated('1', keep=False)
#another solution with Series.duplicated
#mask = df['1'].duplicated(keep=False)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
dtype: bool
print (df[mask])
0 1
0 C1 V1
1 C2 V1
2 C3 V1
4 C5 V3
5 C6 V3
print (df.columns)
Int64Index([0, 1], dtype='int64')
mask = df.duplicated(1, keep=False)
#another solution with Series.duplicated
#mask = df[1].duplicated(keep=False)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
dtype: bool
print (df[mask])
0 1
0 C1 V1
1 C2 V1
2 C3 V1
4 C5 V3
5 C6 V3

Resources