How to update Spark MatrixFactorizationModel for ALS - apache-spark

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html.
I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense
Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't.
While this is ok for me by now, im curious on how to update a model. While my current solution works like
having all user ratings
generate model
get recommendations for user
I want to have a flow like this:
having a base of ratings
generate model once (optional save & load it)
get some ratings by one user on 10 random movies (not in the model!)
get recommendations using the model and the new user ratings
Therefore I must update my model, without completely recompute it. Is there any chance to do so?
While the first way is good for batch processing (like generating recommendations in nightly batches) the second way would be good for nearly-live generating of recommendations.

Edit: the following worked for me because I had implicit feedback ratings and was only interesting in ranking the products for a new user.
More details here
You can actually get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v
for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v
This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v

It seems like you want to be doing some kind of online learning. That's the notion that you're actually updating the model while receiving data. Spark MLLib has limited streaming machine learning options. There's a streaming linear regression and a streaming K-Means.
Many machine learning problems work just fine with batch solutions, perhaps retraining the model every few hours or days. There are probably strategies for solving this.
One option could be an ensemble model where you combine the results of your ALS with another model that helps make predictions about unseen movies.
If you expect to see a lot of previously unseen movies though, collaborative filtering probably doesn't do what you want. If those new movies aren't in the model at all, there's no way for the model to know what other people who watched those liked.
A better option might be to take a different strategy and try some kind of latent semantic analysis on the movies and model concepts of what a movie is (like genre, themes, etc...), that way new movies with various properties and fit into an existing model, and ratings affect how strongly those properties interact together.

Related

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

word2vec: weighted How can I give negative training data?

Im reusing word2vec for products on my website and users. I would like to say that a user is NEGATIVELY associated to a product if he has visited the page < 5 seconds and POSITIVELY if he spent > 30 seconds on the page. Is there a way to specify this in word2vec? Or is there some other tool that enables this?
Although your question is not well defined but I think you want to store the relation of user with the product which has nothing to do with word2vec. word2vec essentially gives you a mapping from strings to contiguous domain vectors. In your problem you should give a separate new feature of User-Product relationship (NEGATIVE or POSITIVE) along with the word2vec features and you can let the model retrain the word-embeddings according to this new POSITIVE/NEGATIVE feature while solving your particular task. This way the model will adjust the word-embeddings and get some of the desired effect of the POSITIVE/NEGATIVE features.
Please be more elaborate so that I can answer your question in a better way.

Spark MLLib Collaborative Filtering with new user

I'm trying out the Collaborative Filtering algorithm implemented in Spark and am running into the following issue:
Suppose I train a model with the following data:
u1|p1|3
u1|p2|3
u2|p1|2
u2|p2|3
Now if I test it with the following data:
u1|p1|1
u3|p1|2
u3|p2|3
I never see any ratings for the user 'u3', presumably because that user does not appear in the training data. Is this because of the cold start issue? I was under the impression that this issue would apply only to a new product. In this case, I would have expected a prediction for 'u3' since 'u1' and 'u2' in the training data have similar rating information to 'u3'. Is this the distinction between model-based and memory-based collaborative filtering?
I assume you are talking about the ALS algorithm?
'u3' is not pair of your training set and therefore your model does not know anything about that user. All one could to is maybe return the mean rating over all users.
Looking into the Spark 1.3.0 Scala code: The MatrixFactorizationModel returned by ALS.train() tries to lookup user and product in the feature vectors when you call predict(). I get a NoSuchElementException when I try to predict a rating of an unknown user. It is just implemented that way.

Spark - How to use the trained recommender model in production?

I am using Spark to build a recommendation system prototype. After going through some tutorials, I have been able to train a MatrixFactorizationModel from my data.
However, the model trained by Spark mllib is just a Serializable. How can I use this model to do recommendation for real users? I mean, how can I persist the model into some sort of database or update it if the user data has been incremented?
For example, the model trained by Mahout recommendation library can be stored into databases like Redis, then we can query for the recommended item list later. But how can we do similar stuff in Spark? Any suggestion?
First, the "model" you're referring to from Mahout is not a model, but a pre-computed list of recommendations. You could also do this with Spark, and compute in batch recommendations for users, and persist them anywhere you like. This has nothing to do with serializing a model. If you don't want to do real-time updates or scoring, you can stop there and just use Spark for batch just like you do Mahout.
But I agree that in a lot of cases you do want to ship the model somewhere else and serve it. As you can see, other models in Spark are Serializable, but not MatrixFactorizationModel. (Yes, even though it's marked as such, it won't serialize.) Likewise, there is a standard serialization for predictive models called PMML but it contains no vocabulary for a factored matrix model.
The reason is actually the same. Whereas many predictive models, like an SVM or logistic regression model, are just a small set of coefficients, a factored matrix model is huge, containing two matrices with potentially billions of elements. That is why I think PMML doesn't have any reasonable encoding for it.
Likewise, in Spark, that means the actual matrices are RDDs that can't be serialized directly. You can persist these RDDs to storage, re-read them elsewhere using Spark, and recreate a MatrixFactorizationModel by hand that way.
You can't serve or update the model using Spark though. For this you are really looking at writing some code to perform updates and calculate recommendations on the fly.
I don't mind suggesting here the Oryx project, since its point is to manage exactly this aspect, particularly for ALS recommendation. In fact, the Oryx 2 project is based on Spark and although in alpha, already contains the complete pipeline to serialize and serve the output of MatrixFactorizationModel. I don't know if it meets your needs, but may at least be an interesting reference point.
Another method for creating recs with Spark is the search engine method. This is basically a cooccurrence recommender served by Solr or Elasticsearch. Comparing factorized to cooccurrence is beyond this question so I'll just describe the latter.
You feed interactions (user-id,item-id) into Mahout's spark-itemsimilarity. This produces a list of similar items for every item seen in the interaction data. It will come out by default as a csv and so can be stored anywhere. But it needs to be indexed by a search engine.
In any case when you want to fetch recs you use the user's history as the query, you get back an ordered list of items as recs.
One benefit of this method is that indicators can be calculated for as many user actions as you want. Any action the user takes that correlates to what you want to recommend can be used. For instance if you want to recommend a purchase but you record product-views as well. If you treated product-views the same as purchases you would likely get worse recs (I've tried it). However if you calculate an indicator for purchases and another (actually cross-cooccurrence) indicator for product-views they are equally predictive of purchases. This has the effect of increasing the data used for recs. The same type of thing can be done with user locations to blend in location information into purchase recs.
You can also bias your recs based on context. If you are in the "electronics" section of a catalog, you may want recs to be skewed towards electronics. Add electronics to the query against the item's "category" metadata field and give it a boost in the query and you have biased recs.
Since all of the biasing and mixing of indicators happens in the query it makes the recs engine easily tuned to multiple contexts while maintaining only one multi-field query made through a search engine. We get scalability from Solr or Elasticsearch.
One other benefit of either factorization or the search method is that entirely new users and new history can be used to create recs where the older Mahout recommenders could only recommend to users and interactions known when the job was run.
Descriptions here:
Mahout docs
Slides
Mahout on Spark: What’s New in Recommenders, part 1
Mahout on Spark: What’s New in Recommenders, part 2
Practical Machine Learning ebook
You should run model.predictAll() on a reduced RDD set of (user,product) pairs like in the Mahout Hadoop Job and store the results for online usage...
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
You can use the function .save(sparkContext, outputFolder) to save the model to a folder of your choice. While giving the recommendations in realtime, you just have to use MatrixFactorizationModel.load(sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.
A question to #Sean Owen: Doesn't the MatrixFactorizationObject contain the Factorization matrices: user-feature and item-feature matrices instead of recommendations/predicted ratings.

Incremental training of ALS model

I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark.
My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts.
In my app data "events" are inserted in real-time, but to get updated prediction results I need to "pio train" and "pio deploy". This takes some time and the server goes offline during the redeploy.
I'm trying to figure out if I can do incremental training during the "predict" phase, but cannot find an answer.
I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.
Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.
So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:
https://github.com/brkyvz/streaming-matrix-factorization
The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.
The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.
In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.
Interesting notes for SGD updates:
http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf
For updating Your model near-online (I write near, because face it, the true online update is impossible) by using fold-in technique, e.g.:
Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems.
Ou You can look at code of:
MyMediaLite
Oryx - framework build with Lambda Architecture paradigm. And it should have updates with fold-in of new users/items.
It's the part of my answer for similar question where both problems: near-online training and handling new users/items were mixed.

Resources