featuretools: accumulate the unique_value groupby user with timestamp - featuretools

I have the dataset like this,
user_id event_name event_timestamp origin
0 1001790 deals 2020-01-01 12:07:05.089002
1 1001818 purchase 2019-10-30 09:15:38.810000 ICN
2 1001969 deals 2019-12-16 01:11:06.595004
3 1001969 deals 2019-12-16 01:11:22.811008
4 1001969 purchase 2019-12-21 12:20:24.405000 PUS
5 1001969 view_item 2019-12-21 12:22:01.318000 ICN
es = ft.EntitySet(id="dataset")
variable_types = {
'event_timestamp': ft.variable_types.Datetime,
'user_id': ft.variable_types.Id,
'origin': ft.variable_types.Categorical,
'event_name': ft.variable_types.Categorical,
}
es.entity_from_dataframe(
entity_id='total',
dataframe=total,
index='event_timestamp',
variable_types=variable_types,
)
es.normalize_entity(
base_entity_id='total',
new_entity_id='users',
index='user_id',
copy_variables=['event_timestamp'],
make_time_index=False,
)
es.normalize_entity(
base_entity_id='total',
new_entity_id='origin',
index='origin',
make_time_index=False,
)
es.normalize_entity(
base_entity_id='total',
new_entity_id='event_name',
index='event_name',
make_time_index=False,
)
And I want the result like
NUM_UNIQUE(total.event_name) NUM_UNIQUE(total.origin)
user_id time
1001818 2019-10-30 09:15:38.810000 1 1
1001969 2019-12-21 12:11:06.595004 1 0
2019-12-21 12:11:22.811008 1 0
2019-12-21 12:20:24.405000 1 1
2019-12-21 12:22:01.318000 2 2
1001790 2020-01-01 12:07:05.089002 1 1
Thus, if I set the window 5mins, in user_id 1001969, accumulative count should not work between second and third one.

You can apply a training window to each cutoff time for rolling windows. Here are the cutoff times:
user_id time
1001969 2019-12-21 12:11:06.595004
1001969 2019-12-21 12:11:22.811008
1001969 2019-12-21 12:20:24.405000
1001969 2019-12-21 12:22:01.318000
In DFS, I apply a five minute training window to each cutoff time.
fm, fd = ft.dfs(
target_entity='users',
entityset=es,
agg_primitives=['num_unique'],
trans_primitives=[],
cutoff_time=cutoff_time,
cutoff_time_in_index=True,
training_window='5 minutes',
)
The accumulative count should work as expected with the following output.
NUM_UNIQUE(total.origin) NUM_UNIQUE(total.event_name)
user_id time
1001969 2019-12-21 12:11:06.595004 NaN NaN
2019-12-21 12:11:22.811008 NaN NaN
2019-12-21 12:20:24.405000 1.0 1.0
2019-12-21 12:22:01.318000 2.0 2.0

Related

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Pandas Multiple Conditional Mean With Group By

New to python and pandas. I have a pandas DataFrame with list of customer data which includes customer name, Reporting month and performance. I'm trying to get first recorded performance for each customer
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 7CGC 2020-01-01 1.125000
2 ACC 2019-11-01 1.216802
3 ACBH 2019-05-01 0.916667
4 ACBH 2019-06-01 0.893333
5 AKC 2019-10-01 4.163636
6 AKC 2019-11-01 3.915215
Desired output
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 ACC 2019-11-01 1.216802
2 ACBH 2019-05-01 0.916667
3 AKC 2019-10-01 4.163636
Use DataFrame.sort_values with GroupBy.first or DataFrame.drop_duplicates:
df.sort_values('ReportingMonth').groupby('CustomerName', as_index=False).first()
or
new_df = df.sort_values('ReportingMonth').drop_duplicates('CustomerName',
keep = 'first')
print(new_df)
Output
CustomerName ReportingMonth Performance
3 ACBH 2019-05-01 0.916667
5 AKC 2019-10-01 4.163636
2 ACC 2019-11-01 1.216802
0 7CGC 2019-12-01 1.175000
If it is already sorted you don't need sort again

how to compare two data frames based in difference in date

I have two data frames, each has #id column and date column,
I want to find rows in both Data frames that have same id with a date difference more than > 2 days
Normally it's helpful to include a datafrme so that the responder doesn't need to create it. :)
import pandas as pd
from datetime import timedelta
Create two dataframes:
df1 = pd.DataFrame(data={"id":[0,1,2,3,4], "date":["2019-01-01","2019-01-03","2019-01-05","2019-01-07","2019-01-09"]})
df1["date"] = pd.to_datetime(df1["date"])
df2 = pd.DataFrame(data={"id":[0,1,2,8,4], "date":["2019-01-02","2019-01-06","2019-01-09","2019-01-07","2019-01-10"]})
df2["date"] = pd.to_datetime(df2["date"])
They will look like this:
DF1
id date
0 0 2019-01-01
1 1 2019-01-03
2 2 2019-01-05
3 3 2019-01-07
4 4 2019-01-09
DF2
id date
0 0 2019-01-02
1 1 2019-01-06
2 2 2019-01-09
3 8 2019-01-07
4 4 2019-01-10
Merge the two dataframes on 'id' columns:
df_result = df1.merge(df2, on="id")
Resulting in:
id date_x date_y
0 0 2019-01-01 2019-01-02
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09
3 4 2019-01-09 2019-01-10
Then subtract the two day columns and filter for greater than two.
df_result[(df_result["date_y"] - df_result["date_x"]) > timedelta(days=2)]
id date_x date_y
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09

Why I am not able to merge two data frames on a column containing some non-similar entries

train.head()
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
test.head()
ID shop_id item_id
0 0 5 5037
1 1 5 5320
2 2 5 5233
3 3 5 5232
4 4 5 5268
I want to add the item_price column to my test data frame from my train data frame so I am trying to merge the two data frames on “item_id”,
‘item_id’ contains almost 90% similar values in both the data frames but getting a weird result
df=pd.merge(test[['item_id']],train[['item_price','item_id']],on='item_id',how='inner’)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 60732252 entries, 0 to 60732251
Data columns (total 2 columns):
item_id int64
item_price float64
dtypes: float64(1), int64(1)
memory usage: 1.4 GB
Can anybody please help me that what is happening and how may I correct it.
In my opinion there is problem with duplicates.
One possible solution is remove them:
test = test.drop_duplicates('item_id')
train= train.drop_duplicates('item_id')
... or add helper columns for merge:
test['g'] = test.groupby('item_id').cumcount()
train['g'] = train.groupby('item_id').cumcount()
df=pd.merge(test[['item_id', 'g']],
train[['item_price','item_id', 'g']],on=['item_id', 'g']).drop('g', axis=1)

Apply a value to max values in a groupby

I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1
The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values
Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]

Resources