Update marketing campaign paths based on time - apache-spark

I am using Pyspark to process the following dataframe, so it can fit a marketing attribution model:
user_id
timestamp
activity
campaign
event_name
akalsds124
2022-01-01 10:00
click
Holidays Campaign
NULL
akalsds124
2021-12-31 09:00
click
Holidays Campaign
NULL
akalsds124
2022-01-13 15:59
click
X Campaign
NULL
akalsds124
2022-01-10 16:32
click
Super Campaign
NULL
akalsds124
2022-01-05 22:12
click
Holidays Campaign
NULL
akalsds124
2022-01-30 20:55
event
NULL
purchase
akalsds124
2022-01-30 22:10
event
NULL
purchase
akalsds124
2022-01-31 10:13
event
NULL
purchase
akalsds124
2022-02-03 04:55
click
T8 Campaign
NULL
akalsds124
2022-02-07 17:30
click
Y Campaign
NULL
akalsds124
2022-02-12 22:37
event
NULL
purchase
akalsds124
2022-03-31 18:19
click
U9 Campaign
NULL
akalsds124
2022-04-02 23:08
click
II Campaign
NULL
akalsds124
2022-03-02 07:00
click
T8 Campaign
NULL
ijnbmshs33
2022-06-03 17:01
click
Mega Campaign
NULL
ijnbmshs33
2022-05-03 10:31
click
New Campaign
NULL
ijnbmshs33
2022-05-20 17:01
click
Mega Campaign
NULL
An event is an interaction inside the app (e.g. a purchase, login, etc) and a click activity is an ad click made by the user.
I need to create a path with each user's campaign touchpoints inside a list. Each list must include only the touchpoints that the user interacted up to 30 days before the purchase (date of purchase has to be taken into account).
The paths that did not lead to a purchase must be updated after 30 days (the last day of the 30-day window must be counted). The order of the touchpoints is important and duplicates cannot be eliminated.
The output should be like this:
user_ID
path
converted
total_conversions
akalsds124
[Holidays Campaign,Holidays Campaign,Super Campaign,X Campaign]
1
2
akalsds124
[Holidays Campaign,Super Campaign,X Campaign]
1
1
akalsds124
[T8 Campaign, Y Campaign]
1
1
akalsds124
[T8 Campaign, U9 Campaign]
0
0
akalsds124
[II Campaign]
0
0
ijnbmshs33
[New Campaign,Mega Campaign]
0
0
ijnbmshs33
[Mega Campaign]
0
0
You can create the dataframe by using this code:
df=spark.createDataFrame(
[('akalsds124','2022-01-01 10:00','click','Holidays Campaign','NULL'),
('akalsds124','2021-12-31 09:00','click','Holidays Campaign','NULL'),
('akalsds124','2022-01-13 15:59','click','X Campaign','NULL'),
('akalsds124','2022-01-10 16:32','click','Super Campaign','NULL'),
('akalsds124','2022-01-05 22:12','click','Holidays Campaign','NULL'),
('akalsds124','2022-01-30 20:55','event','NULL','purchase'),
('akalsds124','2022-01-30 22:10','event','NULL','purchase'),
('akalsds124','2022-01-31 10:13','event','NULL','purchase'),
('akalsds124','2022-02-03 04:55','click','T8 Campaign','NULL'),
('akalsds124','2022-02-07 17:30','click','Y Campaign','NULL'),
('akalsds124','2022-02-12 22:37','event','NULL','purchase'),
('akalsds124','2022-03-31 18:19','click','U9 Campaign','NULL'),
('akalsds124','2022-04-02 23:08','click','II Campaign','NULL'),
('akalsds124','2022-03-02 07:00','click','T8 Campaign','NULL'),
('ijnbmshs33','2022-06-03 17:01','click','Mega Campaign','NULL'),
('ijnbmshs33','2022-05-03 10:31','click','New Campaign','NULL'),
('ijnbmshs33','2022-05-20 17:01','click','Mega Campaign','NULL')],
['user_id','timestamp','activity','campaign','event_name']
)

I think the trick is using collect_list function on a bounded window. the below code could be first part of your answer
window = W.partitionBy('user_id').orderBy('unixTime').rangeBetween(-3600*24*30, 0)
path_df = (
df
.withColumn('timestamp', F.col('timestamp').cast('timestamp'))
.withColumn('unixTime', F.unix_timestamp('timestamp'))
.withColumn('pathList', F.collect_list('campaign').over(window))
.filter(F.col('event_name') == 'purchase')
)
path_df.sort('timestamp').show(truncate=False)

Related

How to find the first consecutive days of a dataset in SQL?

Im trying to select only the first consecutive dates in SQL from the table
The final results should look like this
This is the example of the initial table.
This is the initial table I have.
> Campaign Code Current Date Campaign Start Date Days Since Campaign
> Launch Next DaTe Next date - Current Date domestic campaign
> 1 10/01/2022 10/01/2022 0 11/01/2022 1 domestic campaign
> 1 11/01/2022 10/01/2022 1 12/01/2022 1 domestic campaign
> 1 12/01/2022 10/01/2022 2 13/01/2022 1 domestic campaign
> 1 13/01/2022 10/01/2022 3 14/01/2022 1 domestic campaign
> 1 14/01/2022 10/01/2022 4 15/01/2022 1 domestic campaign
> 1 15/01/2022 10/01/2022 5 16/01/2022 1 domestic campaign
> 1 16/01/2022 10/01/2022 6 17/01/2022 1 domestic campaign
> 1 17/01/2022 10/01/2022 7 18/01/2022 1 domestic campaign
> 1 18/01/2022 10/01/2022 8 19/01/2022 1 domestic campaign
> 1 19/01/2022 10/01/2022 9 30/01/2022 11 domestic campaign
> 1 30/01/2022 10/01/2022 20 31/01/2022 1 domestic campaign
> 1 31/01/2022 10/01/2022 21 01/02/2022 1 domestic campaign
> 1 01/02/2022 10/01/2022 22 19/05/2022 107 domestic campaign
> 1 19/05/2022 10/01/2022 129 20/05/2022 1
And I am looking to select only the first consecutive dates which would return the following:
Campaign Code Current Date Campaign Start Date Days Since Campaign Launch Next DaTe Next date - Current Date
domestic campaign 1 44571 44571 0 44572 1
domestic campaign 1 44572 44571 1 44573 1
domestic campaign 1 44573 44571 2 44574 1
domestic campaign 1 44574 44571 3 44575 1
domestic campaign 1 44575 44571 4 44576 1
domestic campaign 1 44576 44571 5 44577 1
domestic campaign 1 44577 44571 6 44578 1
domestic campaign 1 44578 44571 7 44579 1
domestic campaign 1 44579 44571 8 44580 1
Any help is much appreciated.
You can use ROW_NUMBER function subtracted from 'Next Date' to create groups for consecutive dates per campaign code, then use DENSE_RANK function to select only the first consecutive group for each campaign code.
With get_groups As
(
Select *,
Date_add('day', -ROW_NUMBER() Over (Partition By CampaignCode Order By NextDate), NextDate) As grp
From table_name
),
get_ranks As
(
Select *,
DENSE_RANK() OVER (Partition By CampaignCode Order By grp) As rnk
From get_groups
)
Select CampaignCode, CurrentDate, CampaignStartDate,
DaysSinceCampaign, NextDate, Nextdate_CurrentDate
From get_ranks
Where rnk = 1;
You may try to google "gaps and islands problem in SQL" for more details about this problem.
Side Note: ('Days Since Campaign', 'Campaign Start Date' and 'Next date - Current Date') can be easily calculated from other columns, so you don't need to store them in the table.

Fill date into columns

I have table:
user_id
date
days_since_install
001
01-01-2021
0
001
02-01-2021
1
001
02-01-2021
2
It is necessary to check if there is "1" in the column "days_since_install" in grouping by use_id and if so, fill in True in the column "retention_1d" otherwise False.
The resulting table should look like this:
user_id
retention_1d
001
True
You can use groupby.first to get the first install per group, then map to map it per user_id:
# get first install value (if you have duplicates you would need to get the min)
d = df[df['event_type'].eq('install')].groupby(df['user_id'])['date'].first()
# map the values per user_id
df['install'] = df['user_id'].map(d)
output:
user_id event_type date install
0 1 install 01-01-2021 01-01-2021
1 1 login 02-01-2021 01-01-2021
2 1 login 04-01-2021 01-01-2021
As a one liner:
df['install'] = df['user_id'].map(df[df['event_type'].eq('install')]
.groupby(df['user_id'])['date'].first())
Use Series.map by Series with filtered install without duplicates by user_id:
df['install'] = (df['user_id'].map(df[df['event_type'].eq('install')]
.drop_duplicates('user_id')
.set_index('user_id')['date']))
print (df)
user_id event_type date install
0 1 install 01-01-2021 01-01-2021
1 1 login 02-01-2021 01-01-2021
2 1 login 04-01-2021 01-01-2021
Is there case one id installs multiple times?
then use groupby + ffill
(df
.assign(install=df['date'].where(df['event_type'] == 'install'))
.assign(install=lambda x: x.groupby('user_id')['install'].ffill())
output:
user_id event_type date install
0 1 install 01-01-2021 01-01-2021
1 1 login 02-01-2021 01-01-2021
2 1 login 04-01-2021 01-01-2021

How Can I extract data in python with the same value of fields using pandas

I Have a dataset with fields id, time, date, name, etc. I want to extract data that has the same id and date. How can I do that?
For example
id time date
1 16:00 03/05/2020
2 16:00 03/05/2020
1 17:00 03/05/2020
1 16:00 04/05/2020
2 16:00 04/05/2020
Now I want to fetch :
1 16:00 03/05/2020
1 17:00 03/05/2020
Can groupby and filter
df.groupby(['id', 'date']).filter(lambda s: len(s) > 1)
id time date
0 1 16:00 03/05/2020
2 1 17:00 03/05/2020

Excel - Vlookup with results filters based on closest date before

I've got 2 tables in excel, first one with total spent per customer for exact date (for some customers there are 2 or more rows ):
Customer_ID Total_$ date
12334 123455 12.12.2017
14446 222222 12.12.2017
10551 333333 12.12.2017
10691 444444 12.12.2017
10295 432432 12.12.2017
10295 132432 10.12.2017
10195 552423 22.12.2017
and the other one with transactions per categories:
Customer_ID Category date
12334 1 12.12.2017
14446 2 12.12.2017
10551 4 12.12.2017
10691 4 12.12.2017
10295 4 12.12.2017
10295 4 12.12.2017
10195 4 12.12.2017
What I need is to match these two tables to see last transaction before the date in table 1 per each category (there are 7) for each row from table 1 and count of transactions per each category (also before date in table 1).
So for example there will be information that for 14.02.2017 client 123424 spent 1000$, bought category 1 twice and last transaction in this category happened 12.02.2017, and also bought category 2 but only once at 13.02.2017 and for the other 5 categories he had no deals (maybe it will be easier to show it in two separate tables). Of course some of transactions for customer_id from table 2 will not match rows from table 1 as they happened after the date in table 1 so it should not be visible.
Any thoughts on that? Maybe I need another tool to do it properly?
Thanks in advance.

Oracle Date Time strig format

I have the following oracle table, where datetime is in a single column & has datatype as string
f_ID f_type f_date
1001 A 3/30/14 12:20:00 PM
1001 B 3/30/14 10:20:00 AM
1002 A 2/3/14 11:0:00 AM
1002 B 2/3/14 9:00:00 AM
1003 A 2/13/14 10:00:00 AM
1003 B 12/13/14 10:00:00 AM
1111 B 12/13/14 10:00:00 AM
I wish to calculate average time taken for all shipments which have count > 1. So time difference in shipment of 1001 is 2 hours, 1002 is 2 hours, 1003 is 10 Months (303 x 24=7272) hours. 1111 has count =1 so it can be excluded from average.
So the average result should be (2+2+7272)/3 = 2425.33 hours.
How do I query that?
This one should work:
WITH t AS
(SELECT f_ID, ABS(b.f_date-a.f_date)*24 AS duration
FROM my_table a
JOIN my_table b USING (f_ID)
WHERE a.f_type = 'A'
AND a.f_type = 'B')
SELECT AVG(duration)
FROM t;

Resources