I have data for around 2 million active customers and around 2-5 years worth of transaction data by customer. This data includes features such as what item that customer bought, what store they bought it from, the date they purchased that item, how much they bought, how much they paid, etc.
I need to predict which of our customers will shop in the next 2 weeks.
Right now my data is set up like this
item_a item_b item_c item_d customer_id visit
dates
6/01 1 0 0 0 cust_123 1
6/02 0 0 0 0 cust_123 0
6/03 0 1 0 0 cust_123 1
6/04 0 0 0 0 cust_123 0
6/05 1 0 0 0 cust_123 1
6/06 0 0 0 0 cust_123 0
6/07 0 0 0 0 cust_123 0
6/08 1 0 0 0 cust_123 1
6/01 0 0 0 0 cust_456 0
6/02 0 0 0 0 cust_456 0
6/03 0 0 0 0 cust_456 0
6/04 0 0 0 0 cust_456 0
6/05 1 0 0 0 cust_456 1
6/06 0 0 0 0 cust_456 0
6/07 0 0 0 0 cust_456 0
6/08 0 0 0 0 cust_456 0
6/01 0 0 0 0 cust_789 0
6/02 0 0 0 0 cust_789 0
6/03 0 0 0 0 cust_789 0
6/04 0 0 0 0 cust_789 0
6/05 0 0 0 0 cust_789 0
6/06 0 0 0 0 cust_789 0
6/07 0 0 0 0 cust_789 0
6/08 0 1 1 0 cust_789 1
should I make the target variable be something like
df['target_variable']='no_purchase'
for cust in list(set(df['customer'])):
df['target_variable']=np.where(df['visit']>0,cust,df['target_variable'])
or have my visit feature be my target variable? If it's the latter, should I OHE all 2 million customers? If not, how should I set this up on Keras so that it classifies visits for all 2 million customers?
I think you should better understand your problem -- your problem requires strong domain knowledge to correct model it, and it can be modeled in many different ways, and below are just some examples:
Regression problem: given a customer's purchase record only containing relative date, e.g.
construct a sequence like [date2-date1, date3-date2, date4-date3, ...] from your data.
[6, 7, 5, 13, ...] means a customer is likely to buy things on the weekly or biweekly basis
[24, 30, 33, ...] means a customer is likely to buy things on the monthly basis.
If you organize problem in this way, all you need is to predict what is the next number in a given sequence. You may easily get such data by
randomly select a full sequence, say [a, b, c, d, e, f, ..., z]
randomly select a position to predict, say x
pick K (say K=6) proceeding sequence [r, s, t, u, v, w]as your network input, and x as your network target.
Once you have this model been trained, your ultimate task can be easily resolved by checking whether the predicted number is greater than 60.
Classification problem: given a customer's purchase record of K months, predict how many purchase will a customer have in the next two month.
Again, you need to create training data from your raw data, but this time the target for a customer is how many items does he purchased in month K+1 and K+2, and you may organize your input data of K-month record in your own way.
Please note, the number of items a customer purchased is a discrete number, but way below 1M. In fact, like in problem of face image based age estimation, people often quantilize the target into bins, e.g. 0-8, 9-16, 17-24, etc. You may do the same thing for your problem. Of course, you may also formulate this target as a regression problem to directly predict how many items.
Why you need to know your problem better?
as you can see, you may come up a number of problem formulations that might all look reasonable at the first glance or very difficult for you to say which one is the best.
it is worthy noting the dependence between a problem set-up and its hidden premise, (you may not notice such things until you think the problem carefully). For example, the regression problem set-up to predict the gap of the next purchase implies that the number of items a customer purchased does not matter. This claim may or may not be fair in your problem.
you may come up a much simpler but more effective solution if you know your problem well.
In most of problems like yours, you don't have to use deep learning or at least not at the first place. Classic approaches may work better.
Related
I have extracted a table from a database and wish to do some topic analysis on some entries. I have created an empty matrix with unique topic names and I have duplicate rows because there are potentially multiple topics associated with each 'name' entry. Ultimately, I would like a dataframe that has 1's across the row where a topic was associated with it. I will then remove the 'topic label' column, and at some point remove duplicate rows. The actual dataframe is much larger, but here I am just showing an illustration.
Here is my data:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 0 0 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 0 0 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 0 0 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 0 0 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 0
And I would like it to look like this:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
I have tried to use .loc in a loop to first slice the data by name, and then assign the values (after setting name as index), but this doesn't work when a row is unique:
name_set = list(set(df['name']))
df = df.set_index('name')
for i in name_set:
df.loc[i, list(df.loc[i]['topic_label'])] = 1
I feel like I am going round in circles here... is there a better way to do this?
One option is to use get_dummies to the dummy variables for each topic_label; then call sum in groupby.transform to aggregate the dummy variables for names:
cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)
The above returns a new DataFrame out. If you want to update df instead, then you can use update:
df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)
Output:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
I have trained a Neural Network to make prediction of outcome(Win/Lose) of a hockey game based on a few metrics.
The data I have been feeding it looks like this:
Each row represents a team in one game, so two specific rows make a match.
Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
1 1 0 10 10 10 1 0 0 0
0 0 1 10 10 10 0 1 0 0
1 1 0 10 10 10 0 0 1 0
0 0 1 10 10 10 0 0 0 1
The predictions from the NN looks like this.
[0.12921564 0.87078434]
[0.63811845 0.3618816 ]
[5.8682327e-04 9.9941313e-01]
[0.97831124 0.02168871]
[0.04394475 0.9560553 ]
[0.76859254 0.23140742]
[0.45620263 0.54379743]
[0.01509337 0.9849066 ]
I believe I understand that the first column is for Lost(0), and second is for Won(1),
but what I don't understand is: Who won against who?
I don't now what to make of these predictions, do they even mean anything to me this way?
According to the Data Set you show here, it seems that the results of the network would show the probability of wining or losing a team in a match depending on the Race Host. I think you should add one more feature to your data set which shows the rival team in the match if you want your network to show the probability of wining the game against other team and the hosting situation( And if hosting is not important for you then you should remove Home and Away columns).
Let us take first two rows of your dataset,
Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
1 1 0 10 10 10 1 0 0 0
0 0 1 10 10 10 0 1 0 0
#predictions
[0.12921564 0.87078434]
[0.63811845 0.3618816 ]
Team 1 played a game in its home and won the match. Model prediction also aligns with it because it has assigned high probability in the second column, which is the probability of winning as you mentioned.
Similarly Team 2 played a game away and lost the match. Model prediction aligns here as well!
you just mentioned that two specific rows make a match but with available information we cannot say who played with whom. Its just a model to predict the probability of winning for a particular team independently.
EDIT:
Assuming that you have data like this!
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
2017020001 1 1 0 10 10 10 1 0 0 0
2017020001 0 0 1 10 10 10 0 1 0 0
You could transform the data as follows, which can improve the model.
Won/Lost metric2 metric3 metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1 10 10 10 1 0 0 0 0 1 0 0
Note: won/Lost value would be for home team, which is mentioned by h_team.
I had a J program I wrote in 1985 (on vax vms). One section was creating a diagonal matrix from a vector.
a=(n,n)R1,nR0
b=In
a=bXa
Maybe it wasn't J but APL in ascii, but these lines work in current J (with appropriate changes in the primitive functions). But not in APL (gnu , NARS2000 or ELI). I get domain error in the last line.
Is there an easy way to do this without looping?
Your code is an ASCII transliteration of APL. The corresponding J code is:
a=.(n,n)$1,n$0
b=.i.n
a=.b*a
Try it online! However, no APL (as of yet — it is being considered for Dyalog APL) has major cell extension which is required on the last line. You therefore need to specify that the scalars of the vector b should be multiplied with the rows of the matrix a using bracket axis notation:
a←(n,n)⍴1,n⍴0
b←⍳n
a←b×[1]a
Try it online! Alternatively, you can use the rank operator (where available):
a←(n,n)⍴1,n⍴0
b←⍳n
a←b(×⍤0 1)a
Try it online!
A more elegant way to address diagonals is ⍉ with repeated axes:
n←5 ◊ z←(n,n)⍴0 ◊ (1 1⍉z)←⍳n ◊ z
1 0 0 0 0
0 2 0 0 0
0 0 3 0 0
0 0 0 4 0
0 0 0 0 5
Given an input vector X, the following works in all APLs, (courtesy of #Adám in chat):
(2⍴S)⍴((2×S)⍴1,-S←⍴X)\X
And here's a place where you can run it online.
Here are my old, inefficient versions that use multiplication and the outer product (the latter causes the inefficiency):
((⍴Q)⍴X)×Q←P∘.=P←⍳⍴X
((⍴Q)⍴X)×Q←P Pρ1,(P←≢X)ρ0
Or another way:
(n∘.=n)×(2ρρn)ρn←⍳5
should give you the following in most APLs
1 0 0 0 0
0 2 0 0 0
0 0 3 0 0
0 0 0 4 0
0 0 0 0 5
This solution works in the old ISO Apl:
a←(n,n)⍴v,(n,n)⍴0
I'm dabbling my feet with J and, to get the ball rolling, decided to write a function that:
gets integer N;
spits out a table that follows this pattern:
(example for N = 4)
1
0 1
0 0 1
0 0 0 1
i.e. in each row number of zeroes increases from 0 up to N - 1.
However, being newbie, I'm stuck. My current labored (and incorrect) solution for N = 4 case looks like:
(4 # ,: 0 1) #~/"1 1 (1 ,.~/ i.4)
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
And the problem with it is twofold:
it's not general enough and looks kinda ugly (parens and " usage);
trailing zeroes - as I understand, all arrays in J are homogeneous, so in my case every row should be boxed.
Like that:
┌───────┐
│1 │
├───────┤
│0 1 │
├───────┤
│0 0 1 │
├───────┤
│0 0 0 1│
└───────┘
Or I should use strings (e.g. '0 0 1') which will be padded with spaces instead of zeroes.
So, what I'm kindly asking here is:
please provide an idiomatic J solution for this task with explanation;
criticize my attempt and point out how it could be finished.
Thanks in advance!
Like so many challenges in J, sometimes it is better to keep your focus on your result and find a different way to get there. In this case, what your initial approach is doing is creating an identity matrix. I would use
=/~#:i. 4
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
You have correctly identified the issue with the trailing 0's and the fact that J will pad out with 0's to avoid ragged arrays. Boxing avoids this padding since each row is self contained.
So create your lists first. I would use overtake to get the extra 0's
4{.1
1 0 0 0
The next line uses 1: to return 1 as a verb and boxes the overtakes from 1 to 4
(>:#:i. <#:{."0 1:) 4
+-+---+-----+-------+
|1|1 0|1 0 0|1 0 0 0|
+-+---+-----+-------+
Since we want this as reversed and then made into strings, we add ":#:|.#: to the process.
(>:#:i. <#:":#:|.#:{."0 1:) 4
+-+---+-----+-------+
|1|0 1|0 0 1|0 0 0 1|
+-+---+-----+-------+
Then we unbox
>#:(>:#:i. <#:":#:|.#:{."0 1:) 4
1
0 1
0 0 1
0 0 0 1
I am not sure this is the way everyone would solve the problem, but it works.
An alternative solution that does not use boxing and uses the dyadic j. (Complex) and the fact that
1j4 # 1
1 0 0 0 0
(1 j. 4) # 1
1 0 0 0 0
(1 #~ 1 j. ]) 4
1 0 0 0 0
So, I create a list for each integer in i. 4, then reverse them and make them into strings. Since they are now strings, the extra padding is done with blanks.
(1 ":#:|.#:#~ 1 j. ])"0#:i. 4
1
0 1
0 0 1
0 0 0 1
Taking this step by step as to hopefully explain a little better.
i.4
0 1 2 3
Which is then applied to (1 ":#:|.#:#~ 1 j. ]) an atom at a time, hence the use of "0
Breaking down what is going on within the parenthesis. I first take the right three verbs which form a fork.
( 1 j. ])"0#:i.4
1 1j1 1j2 1j3
Now, effectively that gives me
1 ":#:|.#:#~ 1 1j1 1j2 1j3
The middle tine of the fork becomes the verb acting on the two noun arguments.The ~ swaps the arguments. so it becomes equivalent to
1 1j1 1j2 1j3 ":#:|.#:# 1
which because of the way #: works is the same as
": |. 1 1j1 1j2 1j3 # 1
I haven't shown the results of these components because using the "0 on the fork changes how the arguments that are sent to the middle tine and assembled later. I'm hoping that there is enough here that with some hand waving the explanation may suffice
The jump from tacit to explicit can be a big one, so it may be a better exercise to write the same verb explicitly to see if it makes more sense.
lowerTriangle =: 3 : 0
rightArg=. i. y
complexCopy=. 1 j. rightArg
1 (":#:|.#:#~)"0 complexCopy
)
lowerTriangle 4
1
0 1
0 0 1
0 0 0 1
lowerTriangle 5
1
0 1
0 0 1
0 0 0 1
0 0 0 0 1
See what happens when you 'get the ball rolling'? I guess the thing about J is that the ball goes down a pretty steep slope no matter where you begin. Exciting, eh?