How to normalize the entity having multiple values for the one feature in featuretools? - featuretools

Below is an example:
buy_log_df = pd.DataFrame(
[
["2020-01-02", 0, 1, 2, 2],
["2020-01-02", 1, 1, 1, 3],
["2020-01-02", 2, 2, 1, 1],
["2020-01-02", 3, 3, 3, 1],
],
columns=['date', 'sale_id', 'customer_id', "item_id", "quantity"]
)
item_df = pd.DataFrame(
[
[1, 100],
[2, 200],
[3, 300],
],
columns=['item_id', 'price']
)
item_df2 = pd.DataFrame(
[
[1, '1 3 10'],
[2, '1 3'],
[3, '2 5'],
],
columns=['item_id', 'tags']
)
As you can see here, each item in item_df has multiple tag values as an one feature.
Here is what I've tried:
item_df2 = pd.concat([item_df2, item_df2['tags'].str.split(expand=True)], axis=1)
item_df2 = pd.melt(
item_df2,
id_vars=['item_id'],
value_vars=[0,1,2],
value_name="tags"
)
tag_log_df = item_df2[item_df2['tags'].notna()].drop("variable", axis=1,).sort_values("item_id")
tag_log_df
>>>
item_id tags
0 1 1
3 1 3
6 1 10
1 2 1
4 2 3
2 3 2
5 3 5
It looks like I can't normalize this item entity (from buy_log entity) because it has multiple duplicated item_ids in the table.
How can I handle this case when I design the entityset?

Thanks for the question. To handle multiple tag values, you can normalize the tags into a data frame before structuring the entity set.
buy_log_df
date sale_id customer_id item_id quantity
2020-01-02 0 1 2 2
2020-01-02 1 1 1 3
2020-01-02 2 2 1 1
2020-01-02 3 3 3 1
item_df
item_id price
1 100
2 200
3 300
tag_log_df
item_id tags
1 1
1 3
1 10
2 1
2 3
3 2
3 5
With the normalized data, you can then structure the entity set.
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='buy_log',
dataframe=buy_log_df,
index='sale_id',
time_index='date',
)
es.entity_from_dataframe(
entity_id='item',
dataframe=item_df,
index='item_id',
)
es.entity_from_dataframe(
entity_id='tag_log',
dataframe=tag_log_df,
index='tag_log_id',
make_index=True,
)
parent = es['item']['item_id']
child = es['buy_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))
child = es['tag_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))

Related

Print a groupby object for a specific group/groups only

I need to print the result of groupby object in Python for a specific group/groups only.
Below is the dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4],
'Entry' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6]})
print("\n df = \n",df)
In order to group the dataferame by ID and print the result I used these codes:
grouped_by_unit = df.groupby(by="ID")
print("\n", grouped_by_unit.apply(print))
Can somebody please let me know below two things:
How can I print the data frame grouped by 'ID=1' only?
I need to get the below output:
Likewise, how can I print the data frame grouped by 'ID=1' and 'ID=4' together?
I need to get the below output:
You can iterate over the groups for example with for-loop:
grouped_by_unit = df.groupby(by="ID")
for id_, g in grouped_by_unit:
if id_ == 1 or id_ == 4:
print(g)
print()
Prints:
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
ID Entry
12 4 1
13 4 2
14 4 3
15 4 4
16 4 5
17 4 6
You can use get_group function:
df.groupby(by="ID").get_group(1)
which prints
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
You can use the same method to print the group for the key 4.

Set 3 level of column names in pandas DataFrame

I'm trying to have a frame with the following structure
h/a totales
sub1 sub2 sub1 sub2
a b ... f g ....m a b ... f g ....m
That being, 2 labels for the first layer, again 2 labels for the second one, and then a subset of column names where sub1 and sub2 doesn't have the same column names.
In order to do so I did the following:
columnas=pd.MultiIndex.from_product([['h/a','totals'],['means','percentages'],
[('means','a'),('means','b'),....('percentage','g'),....],
names=['data level 1','data level 2','data level 3']])
data=[data,pata,......]
newframe=pd.DataFrame(data,columns=columnas)
What I get is this error:
>ValueError: Shape of passed values is (1, 21), indices imply (84, 21)
How can I fix this to have a multi leveled frame by column names?
Thank you
I think need MultiIndex.from_tuples from list comprehensions:
L1 = list('abc')
L2 = list('ghi')
tups = ([('h/a','means', x) for x in L1] +
[('h/a','percentage', x) for x in L2] +
[('totals','means', x) for x in L1] +
[('totals','percentage', x) for x in L2])
columnas=pd.MultiIndex.from_tuples(tups, names=['data level 1','data level 2','data level 3'])
print (columnas)
MultiIndex(levels=[['h/a', 'totals'],
['means', 'percentage'],
['a', 'b', 'c', 'g', 'h', 'i']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['data level 1', 'data level 2', 'data level 3'])
#some random data
np.random.seed(785)
data = np.random.randint(10, size=(3, 12))
print (data)
[[8 0 4 1 2 5 4 1 4 1 1 8]
[1 5 0 7 4 8 4 1 3 8 0 2]
[5 9 4 9 4 6 3 7 0 5 2 1]]
newframe=pd.DataFrame(data,columns=columnas)
print (newframe)
data level 1 h/a totals
data level 2 means percentage means percentage
data level 3 a b c g h i a b c g h i
0 8 0 4 1 2 5 4 1 4 1 1 8
1 1 5 0 7 4 8 4 1 3 8 0 2
2 5 9 4 9 4 6 3 7 0 5 2 1

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]

Remove all data in a DF by group based on a condition (pandas,python3)

I have a pandas DF like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
1 1 3
2 1 3
2 0 4
2 1 1
3 0 2
3 0 3
3 1 4
4 0 1
I want to remove all rows of a users information after they have enrolled. Each users chance to enroll is timed in order. Expected output to look like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
2 1 3
3 0 2
3 0 3
3 1 4
Hoping someone could help me!
EDIT: Example based on comment for correct answer:
User Enrolled Time
4 0 1
4 0 2
4 0 3
5 0 1
I think what you're looking for is a groupby followed by an apply which does the correct logic for each user. For example:
df = pd.DataFrame([[ 1, 0, 12],
[ 1, 0, 1],
[ 1, 1, 2],
[ 1, 1, 3],
[ 2, 1, 3],
[ 2, 0, 4],
[ 2, 1, 1],
[ 3, 0, 2],
[ 3, 0, 3],
[ 3, 1, 4]],
columns=['User', 'Enrolled', 'Time'])
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
return df[df.index <= enrolled]
result = df.groupby('User').apply(filter_enrollment).reset_index(drop=True)
The result is:
>>> print(result)
User Enrolled Time
0 1 0 12
1 1 0 1
2 1 1 2
3 2 1 3
4 3 0 2
5 3 0 3
6 3 1 4
Here I'm assuming your rows are in order of time. If you want to expliticly filter by the time column instead just change index to Time in the filter function.
Edit: to get the answer of the edited question, you can change the filter function to something like this:
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
if pd.isnull(enrolled):
return df
else:
return df[df.index <= enrolled]

Resources