model implicit and explicit behavioral data for recommendation engine - apache-spark

I've following user behavior data,
1. like
2. dislike
3. rating
4. product viewed
5. product purchased
The spark MLlib which support implicit behavioral data with the confident score 0 or 1, Ref (http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html).
For example user 1 viewed product A then the model will be like
1,A,1 (userId, productId, binary confident score)
But by looking at the nature of behavior, product liked has strong confident than product viewed. Product bought has strong confident than product viewed.
How can one model the data based on the type of behavior?

Actually implicit data does not have to be 0 or 1. It means that the values are treated like a confidence or strength of association, rather than a rating. You can simply model actions that show a higher association between user and item as having a higher confidence. A like is stronger than a view, and a purchase is stronger than a like.
In fact, negative confidence can be fit into this framework (and I know MLlib implements that). A dislike can mean a negative score.
What the values are are up to you to tune, really. I think it's reasonable to pick values that correspond to relative frequency, if you have no better idea. For example, if there are generally 50x more page views than likes, maybe a like's value is 50x that of a page view.

Related

Testing significance of Strauss parameters in mppm model

I have a follow up question from my previous post.
Upon creating mppm models like these:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit0 <- mppm(Points ~ group, simba)
fit1 <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:id)
Using anova.mppm to run a likelihood ratio test shows that the interaction is highly significant as a whole, but I would also like to test:
whether each individual id shows significant regularity.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
perform pairwise comparisons of regularity between different ids.
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach. Also, I do not think the "summary output" with the p-values for each Strauss interaction parameter can be used for pairwise comparisons other than to the reference level.
Any advice is greatly appreciated.
Thank you!
First let me explain that, for Gibbs models, the likelihood is intractable, so anova.mppm performs the adjusted composite likelihood ratio test, not the likelihood ratio test. However, you can essentially treat this as if it were the likelihood ratio test based on deviance differences.
whether each individual id shows significant regularity
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach.
This is appropriate. Use ppm to fit a Strauss model to an individual point pattern, and use anova.ppm to test whether the Strauss interaction is statistically significant.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
Introduce a new categorical variable (factor) f, say, that separates the two groups that you want to compare. In your model, add the term f:str to the interaction formula; this gives you the alternative hypothesis. The null and alternative models are identical except that the alternative includes the term f:str in the interaction formula. Now apply anova.mppm. Like all analyses of variance, this performs a two-sided test. For the one-sided test, inspect the sign of the coefficient of f:str in the fitted alternative model. If it has the sign that you wanted, report it as significant at the same p-value. Otherwise, report it as non-significant.
perform pairwise comparisons of regularity between different ids.
This is not yet supported (in theory or in software).
[Congratulations, you have reached the boundary of existing methodology!]

PySpark how to incorporate user item features while building a recommender?

PySparks mllib package provides train() and trainimplicit() methods for training a recommendation model on explicit and implicit data respectively.
I want to train a model on implicit data. More specifically item purchase data. Since it is very rare in my case that a user will purchase an item more than once, the "ratings" or "preference" is always 1. So my dataset looks like:
u1, i1, 1
u1, i2, 1
u2, i2, 1
u2, i3, 1
...
un, im, 1
where u represents a user and i an item.
I do have a lot of features for users demographics, location, etc. as well as item features. But I cannot incorporate user or item features in pyspark.mllib.als.train or pyspark.mllib.als.trainimplicit methods.
Alternatively, I have considered using fastFM or libfm. Both are packages for Factorization Machines which implements an ALS solver and frames recommendation as a regression/classification problem. Using those cases I can include the user, item and more features in the training data as X. However, the predicted variable y will only be a vector of ones (I don't have explicit ratings only purchases).
How do I get around this issue?
MF in Spark is a simple collaborative filtering implementation based on user-item events(implicit)/ratings(explicit). You can introduce a contextual information for 2D (user-item) recommender by pre-filtering or post-filtering data. For example, you have a demographic information M/F and kNN recommender (can be MF, doesn't matter), for pre-filtering first thing what you are doing is to select only records which have the same context. Than, you running kNN on them. For MF, doing the same way, two models have to be generated - for F and M. Then, while generating recommendation at first step you select the right model. Both techniques are well described in "Recommender Systems Handbook".
Modeling context - FM is a good way to go. Think, this post maybe useful for you: How to use Python's FastFM library (factorization machines) for recommendation tasks?. You will find there how negative examples are introduce for implicit users' feedback. And pay attention for a ranking prediction - mostly for recommendations is a right way to go.
Another option - introduce your own heuristic, e.g. by busting the final score. Maybe you got some knowledge/business goal/other thing that can introduce value for you or the users.

How Information Gain Works in Text Classification

I have to learn information gain for feature selection right now,
But I don't have clear comprehension about it. I am a newbie, and I'm confused about it.
How to use IG in feature selection (manual calculation)?
I just have clue this .. That have anyone can help me how to use the formula:
then this is the example:
How to use information gain in feature selection?
Information gain (InfoGain(t)) measures the number of bits of information obtained for prediction of a class (c) by knowing the presence or absence of a term (t) in a document.
Concisely, the information gain is a measure of the reduction in entropy of the class variable after the value for the feature is observed. In other words, information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes.
In text classification, feature means the terms appeared in documents (a.k.a corpus). Consider, two terms in the corpus - term1 and term2. If term1 is reducing entropy of the class variable by a larger value than term2, then term1 is more useful than term2 for document classification in this example.
Example in the context of sentiment classification
A word that occurs primarily in positive movie reviews and rarely in negative reviews contains high information. For example, the presence of the word “magnificent” in a movie review is a strong indicator that the review is positive. That makes “magnificent” a high informative word.
Compute entropy and information gain in python
Measuring Entropy and Information Gain
The formula comes from mutual information, in this case, you can think of mutual information as how much information the presence of the term t gives us for guessing the class.
Check: https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html

How to classify text when pre defined categories are not available

I have a problem and not getting idea which algorithm have to apply.
I am thinking to apply clustering in case two but no idea on case one:
I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer.
Sample:
2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc.
Questions:
1. How would classify each entry given no pre defined categories.
2. How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
1) How would classify each entry given no pre defined categories.
You wouldn't. Instead, you'd use some dimensionality reduction algorithm on the data's features to them in 2-d, make a guess at the number of "natural" clusters, then run a clustering algorithm.
2) How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
You'd manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you'd check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources