calculate time-windowed profiles with featuretools dfs - featuretools

i am having trouble understand the cutoff_dates concept.
what i am really looking for is calculating different features by a time window that is let's say 60 days back (without the current transaction) , the cutoff_dates looks like hard coded dates in the examples.
i am using time index for each row (A_time below), and according to the docs in here what_is_cutoff_datetime :
The time index is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored.
so it is not clear if i don't put the cutoff date the feature will be calculated until the time index value or not.
here is my entityset definition:
es = ft.EntitySet('payment')
es = es.entity_from_dataframe(entity_id='tableA',
dataframe=tableA_dfpd,
index='paymentIndex',
time_index='A_time')
es.normalize_entity(base_entity_id='tableA',
new_entity_id='tableB',
index='B_index',
additional_variables=['B_x','B_time'],
make_time_index='B_time')
es.normalize_entity(base_entity_id='tableA',
new_entity_id='tableC',
index='C_index',
additional_variables=["C_x","C_date"],
make_time_index="C_date")
es.normalize_entity(base_entity_id='tableA',
new_entity_id='tableD',
index='D_index',
additional_variables=["D_x"],
make_time_index=False)
Entityset: payment
Entities:
tableA [Rows: 310083, Columns: 8]
tableB [Rows: 30296, Columns: 3]
tableC [Rows: 206565, Columns: 3]
tableD [Rows: 18493, Columns: 2]
Relationships:
tableA.B_index -> tableB.B_index
tableA.C_index -> tableC.C_index
tableA.D_index -> tableD.D_index
how exactly i can do the window calculation? do i need to pass the cutoff dates or not ? to dfs method ?
i want to use all window calculations based on A_time variable, for a 60 days window up to current transaction, so actually the cutoff date for every transaction is the time_A value of that transaction. , isn't it ?

Thanks for the question. You can calculate features based on a time window by using a training window in DFS. You can also exclude transactions at the cutoff times by setting include_cutoff_time=False. I'll use this dataset of transactions to go through an example.
import featuretools as ft
df = ft.demo.load_mock_customer(return_single_table=True)
df = df[['transaction_id', 'transaction_time', 'customer_id', 'amount']]
df.sort_values(['customer_id', 'transaction_time'], inplace=True)
df.head()
transaction_id transaction_time customer_id amount
290 2014-01-01 00:44:25 1 21.35
275 2014-01-01 00:45:30 1 108.11
101 2014-01-01 00:46:35 1 112.53
80 2014-01-01 00:47:40 1 6.29
484 2014-01-01 00:48:45 1 47.95
First, we create an entity set for transactions and customers.
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='transactions',
index='transaction_id',
time_index='transaction_time',
dataframe=df,
)
es.normalize_entity(
base_entity_id='transactions',
new_entity_id='customers',
index='customer_id',
)
es.add_last_time_indexes()
Entityset: None
Entities:
transactions [Rows: 500, Columns: 4]
customers [Rows: 5, Columns: 2]
Relationships:
transactions.customer_id -> customers.customer_id
Then, we create a cutoff time at each transaction for each customer.
cutoff_time = df[['customer_id', 'transaction_time']]
cutoff_time['time'] = cutoff_time.pop('transaction_time')
cutoff_time.head()
customer_id time
1 2014-01-01 00:44:25
1 2014-01-01 00:45:30
1 2014-01-01 00:46:35
1 2014-01-01 00:47:40
1 2014-01-01 00:48:45
Now, we can run DFS using a training window to calculate features based on a time window. In this example, we'll set the training window to 1 hour. This will include all transactions within 1 hour before the cutoff time for each customer.
By default, transactions at the cutoff times are also included in the calculation. We can exclude those transactions by setting include_cutoff_time=False.
fm, fd = ft.dfs(
target_entity='customers',
entityset=es,
cutoff_time=cutoff_time,
include_cutoff_time=False,
cutoff_time_in_index=True,
training_window='1h',
trans_primitives=[],
agg_primitives=['sum'],
verbose=True,
)
fm.sort_index().head()
SUM(transactions.amount)
customer_id time
1 2014-01-01 00:44:25 0.00
2014-01-01 00:45:30 21.35
2014-01-01 00:46:35 129.46
2014-01-01 00:47:40 241.99
2014-01-01 00:48:45 248.28
If the cutoff times are not passed to DFS, then all transactions for each customer are included in the calculation. Let me know if this helps.

Related

Rearrange Dataframe with Datetime-Index to multiple columns

I have a pandas Dataframe with a Datetime-Index and just one column with a measured value at that time:
Index
Value
2017-01-01 05:00:00
2.8
2017-01-01 05:15:00
3.2
I have data for several years now, one value every 15 minutes. I want to reorganize the df to this (I'm preparing the data to train a Neural Network, each line will be one input):
Index
0 days 05:00:00
0 days 05:00:00
...
1 days 04:45:00
2017-01-01
2.8
3.2
...
1.9
2017-01-02
...
The fastest, most "python" way I could find, was this (with df being the original data, df_result the empty target df):
# prepare df
df_result = pd.DataFrame(index=days_array, columns=times_array)
# fill df
df_result = df_result.apply(order_data_by_days, df=df, log=log, axis=1)
def order_data_by_days(row, df):
for col in row.index:
row[col] = df.loc[row.name + col].values[0]
return row
But this takes >20 seconds for 3.5 years of data! (~120k datapoints). Does anyone have any idea how I could do this a lot faster (I'm aiming at a couple of seconds).
If not, I would try to the the transformation with some other language before the import.
I found a solution, if anyone else has this issue:
Step 1: create target df_result with index (dates, e.g. 2018-01-01, 2018-01-02, ...) as datetime.date and columns (times, e.g. 0 days 05:00:00, 0 days 05:15:00, ..., 1 days 04:45:00) as timedelta.
Step 2: use a for-loop to go through all times. Filter the original df each time using the between_time-function, write the filtered df into the target df_result:
for j in range(0,len(times_array)):
this_time = get_str_from_timedelta(times_array[j], log)
df_this_time = df.between_time(this_time, this_time)
if df_result.empty:
df_result = pd.DataFrame(index=df_this_time.index.date, columns=times_array)
df_this_time.index = df_this_time.index.date
if times_array[j] >= timedelta(days=1):
df_this_time.index = df_this_time.index - timedelta(days=1)
df_result[times_array[j]] = df_this_time[pv]
Note that in my case I checked if the times are actually from next day (timedelta(days=1)), since my "day" starts at 05:00 a.m. and lasts until 04:45 a.m. the next day. To make sure they end up in the same row of df_result (even though, technically, the date-index is wrong here), I use the last if.

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

How to map sales against purchases sequentially using python?

I have a transaction dataframe as under:
Item Date Code Qty Price Value
0 A 01-01-01 Buy 10 100.5 1005.0
1 A 02-01-01 Buy 5 120.0 600.0
2 A 03-01-01 Sell 12 125.0 1500.0
3 A 04-01-01 Buy 9 110.0 990.0
4 A 04-01-01 Sell 1 100.0 100.0
#and so on... there are a million rows with about thousand items (here just one item A)
What I want is to map each selling transaction against purchase transaction in a sequential manner of FIRST IN FIRST OUT. So, the purchase that was made first will be sold out first.
For this, I have added a new column bQty with opening balance same as purchase quantity. Then I iterate through the dataframe for each sell transaction to set the sold quantity off against purchase transaction before that date.
df['bQty'] = df[df['Code']=='Buy']['Quantity']
for each in df[df['Code']=='Sell']:
for each in df[(df['Code']=='Buy') & (df['Date'] <= sellDate)]:
#code#
Now this requires me to go through the whole dataframe again and again for each sell transaction.
For 1000 records it takes about 10 seconds to complete. So, we can assume that for a million records, this approach will take a lot time.
Is there any faster way to do this?
If you are only interested in the resulting final balance values per item, here is a fast way to calculate them:
Add two additional columns that contain the same absolute values as Qty and Value, but with a negative sign in those rows where the Code value is Sell. Then you can group by item and sum these values for each item, to get the remaining number of items and the money spent for them on balance.
sale = df.Code == 'Sell'
df['Qty_signed'] = df.Qty.copy()
df.loc[sale, 'Qty_signed'] *= -1
df['Value_signed'] = df.Value.copy()
df.loc[sale, 'Value_signed'] *= -1
qty_remaining = df.groupby('Item')['Qty_signed'].sum()
print(qty_remaining)
money_spent = df.groupby('Item')['Value_signed'].sum()
print(money_spent)
Output:
Item
A 11
Name: Qty_signed, dtype: int64
Item
A 995.0
Name: Value_signed, dtype: float64

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0

Pyspark: How do I get today's score and 30 day avg score in a single row

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?
You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

Resources