Spark- The purpose of saving ALS model - apache-spark

I'm trying to understand what would be a purpose of storing ALS model and what would be a use case for use of stored model.
I have a dataset which has over 300M rows and I'm using Hadoop Cluster and Spark to calculate recommendations based on ALS algorithm.
Whole computation takes around 5h and I'm wondering what would be the case of storing my model and use it- for example- the next day and... I don't see any. So, either I'm doing something wrong (which is possible, taking into account fact that I'm beginner in ML world) or ALS algorithm in Spark and possibility of saving on disk is not very helpful.
Right now, I use it as following:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(df_input)
df_recommendations = model.recommendForAllUsers(10)
And as I mentioned. df_input is a DataFrame which contains over 300M rows. Total calculation time is around 5h and after that I receive 10 recommended items for each user in the dataset.
In many tutorials or books. There is an example of training the model and validate it with test data. Something like:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
(training, test) = df_input.randomSplit(weights = [0.7, 0.3])
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.train(training)
model.write().save("saved_model")
...
model = ALSModel.load('saved_model')
predictions = model.transform(test) // or df_input to get predictions for each user
I don't see any pros of using it in a such way. However I see a one big cons- You don't use 30% of data to train a model
As far as I know there isn't a way to use ALS model online (in real time). At least without using any external package/library.
You can't incrementally update this model.
You can't use it for newly registered users because there they don't exist in stored Matrix Factorization, so there won't be any recommendations for them.
All you can do is to check what would be a prediction for given user-item pair. Which is basically the same thing which would be return in the first example of code (with used fit() method)
What would be a reason to store this model on disk and load it when needed? or when (what conditions should be met) should I consider to store model and reuse it? Could you provide a use case?

Related

difference between tf.data.Dataset batch and map and tf.contrib.data.map_and_batch

I created a tf.data.Dataset and want to train a model using this dataset:
dataset = dataset.prefeth()
dataset = dataset.shuffle()
dataset = dataset.repeat()
dataset = dataset.map()
dataset = dataset.filter()
dataset = dataset.batch()
I want to know what is the difference between the above dataset with the bellow one:
dataset = dataset.prefeth()
dataset = dataset.shuffle()
dataset = dataset.repeat()
dataset = dataset.apply(tf.contrib.data.map_and_batch())
I know that they should not be different except in performance. But I don't know should I use the .apply() method or not?
Is the first implementation correct?
First off, most of the tf.contrib.data functions are deprecated and moved to tf.data.experimental. So watch out for that.
Take a look at input pipeline performance guide to get a good idea about what could be a good optimal ordering of the transformations for your application. Regarding map and batch, yes we pass the result of the map and batch to the apply function, and it is specified in the return description of map and batch for reference confirmation.
And we want to use map and batch for efficiency reasons, which normally depends on what your data is and how costly your map function is. The performance guide has some guidelines for the same.
Regarding the difference between your first and second blocks of code, there is a filter function in between, so both blocks might not give the same result depending on what you are filtering.

The best practice to use Spark's generated mllib model as a server

I am trying to find out what the proper way is to use a model generated by Spark+MLlib (in this case a Collaborative Filtering Recommendation Engine) to provide predictions quickly, on demand, and as a server.
My current solution is to run an instance of Spark continuously for this purpose, but I wanted to know whether there are better solutions to this, perhaps a solution that does not require a running Spark. Perhaps there is a way to load and use a generated model by Spark without involving Spark?
You can export a model via pmml and then take that model and use it in another application.
Now i find the way。 First,we can save als model's product_features and user_feaures by model.productFeatures() and Model.userFeatures()
Then we get product features like this
209699159874445020
0.0533636957407,-0.0878632888198,0.105949401855,0.129774808884,0.0953511446714,0.16420891881,0.0558457262814,0.0587058141828
So we can load product features and user features into two dicts in python and make a server by tornado to predict ratings using these two dicts. I will show the code for example.
def predict(item_id, user_id):
ind = item_id_index[item_id]
gf = goods_features[ind,1:]
ind = user_id_index[user_id]
uf = user_features[ind,1:]
return blas.ddot(gf,uf,len(gf),0,1,0,1)
As conclusion. We need to persist als model by ourselves and it isn't as difficult as we thought. Any suggestions are welcome.

How to update Spark MatrixFactorizationModel for ALS

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html.
I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense
Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't.
While this is ok for me by now, im curious on how to update a model. While my current solution works like
having all user ratings
generate model
get recommendations for user
I want to have a flow like this:
having a base of ratings
generate model once (optional save & load it)
get some ratings by one user on 10 random movies (not in the model!)
get recommendations using the model and the new user ratings
Therefore I must update my model, without completely recompute it. Is there any chance to do so?
While the first way is good for batch processing (like generating recommendations in nightly batches) the second way would be good for nearly-live generating of recommendations.
Edit: the following worked for me because I had implicit feedback ratings and was only interesting in ranking the products for a new user.
More details here
You can actually get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v
for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v
This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v
It seems like you want to be doing some kind of online learning. That's the notion that you're actually updating the model while receiving data. Spark MLLib has limited streaming machine learning options. There's a streaming linear regression and a streaming K-Means.
Many machine learning problems work just fine with batch solutions, perhaps retraining the model every few hours or days. There are probably strategies for solving this.
One option could be an ensemble model where you combine the results of your ALS with another model that helps make predictions about unseen movies.
If you expect to see a lot of previously unseen movies though, collaborative filtering probably doesn't do what you want. If those new movies aren't in the model at all, there's no way for the model to know what other people who watched those liked.
A better option might be to take a different strategy and try some kind of latent semantic analysis on the movies and model concepts of what a movie is (like genre, themes, etc...), that way new movies with various properties and fit into an existing model, and ratings affect how strongly those properties interact together.

Spark MLLib Collaborative Filtering with new user

I'm trying out the Collaborative Filtering algorithm implemented in Spark and am running into the following issue:
Suppose I train a model with the following data:
u1|p1|3
u1|p2|3
u2|p1|2
u2|p2|3
Now if I test it with the following data:
u1|p1|1
u3|p1|2
u3|p2|3
I never see any ratings for the user 'u3', presumably because that user does not appear in the training data. Is this because of the cold start issue? I was under the impression that this issue would apply only to a new product. In this case, I would have expected a prediction for 'u3' since 'u1' and 'u2' in the training data have similar rating information to 'u3'. Is this the distinction between model-based and memory-based collaborative filtering?
I assume you are talking about the ALS algorithm?
'u3' is not pair of your training set and therefore your model does not know anything about that user. All one could to is maybe return the mean rating over all users.
Looking into the Spark 1.3.0 Scala code: The MatrixFactorizationModel returned by ALS.train() tries to lookup user and product in the feature vectors when you call predict(). I get a NoSuchElementException when I try to predict a rating of an unknown user. It is just implemented that way.

Spark - How to use the trained recommender model in production?

I am using Spark to build a recommendation system prototype. After going through some tutorials, I have been able to train a MatrixFactorizationModel from my data.
However, the model trained by Spark mllib is just a Serializable. How can I use this model to do recommendation for real users? I mean, how can I persist the model into some sort of database or update it if the user data has been incremented?
For example, the model trained by Mahout recommendation library can be stored into databases like Redis, then we can query for the recommended item list later. But how can we do similar stuff in Spark? Any suggestion?
First, the "model" you're referring to from Mahout is not a model, but a pre-computed list of recommendations. You could also do this with Spark, and compute in batch recommendations for users, and persist them anywhere you like. This has nothing to do with serializing a model. If you don't want to do real-time updates or scoring, you can stop there and just use Spark for batch just like you do Mahout.
But I agree that in a lot of cases you do want to ship the model somewhere else and serve it. As you can see, other models in Spark are Serializable, but not MatrixFactorizationModel. (Yes, even though it's marked as such, it won't serialize.) Likewise, there is a standard serialization for predictive models called PMML but it contains no vocabulary for a factored matrix model.
The reason is actually the same. Whereas many predictive models, like an SVM or logistic regression model, are just a small set of coefficients, a factored matrix model is huge, containing two matrices with potentially billions of elements. That is why I think PMML doesn't have any reasonable encoding for it.
Likewise, in Spark, that means the actual matrices are RDDs that can't be serialized directly. You can persist these RDDs to storage, re-read them elsewhere using Spark, and recreate a MatrixFactorizationModel by hand that way.
You can't serve or update the model using Spark though. For this you are really looking at writing some code to perform updates and calculate recommendations on the fly.
I don't mind suggesting here the Oryx project, since its point is to manage exactly this aspect, particularly for ALS recommendation. In fact, the Oryx 2 project is based on Spark and although in alpha, already contains the complete pipeline to serialize and serve the output of MatrixFactorizationModel. I don't know if it meets your needs, but may at least be an interesting reference point.
Another method for creating recs with Spark is the search engine method. This is basically a cooccurrence recommender served by Solr or Elasticsearch. Comparing factorized to cooccurrence is beyond this question so I'll just describe the latter.
You feed interactions (user-id,item-id) into Mahout's spark-itemsimilarity. This produces a list of similar items for every item seen in the interaction data. It will come out by default as a csv and so can be stored anywhere. But it needs to be indexed by a search engine.
In any case when you want to fetch recs you use the user's history as the query, you get back an ordered list of items as recs.
One benefit of this method is that indicators can be calculated for as many user actions as you want. Any action the user takes that correlates to what you want to recommend can be used. For instance if you want to recommend a purchase but you record product-views as well. If you treated product-views the same as purchases you would likely get worse recs (I've tried it). However if you calculate an indicator for purchases and another (actually cross-cooccurrence) indicator for product-views they are equally predictive of purchases. This has the effect of increasing the data used for recs. The same type of thing can be done with user locations to blend in location information into purchase recs.
You can also bias your recs based on context. If you are in the "electronics" section of a catalog, you may want recs to be skewed towards electronics. Add electronics to the query against the item's "category" metadata field and give it a boost in the query and you have biased recs.
Since all of the biasing and mixing of indicators happens in the query it makes the recs engine easily tuned to multiple contexts while maintaining only one multi-field query made through a search engine. We get scalability from Solr or Elasticsearch.
One other benefit of either factorization or the search method is that entirely new users and new history can be used to create recs where the older Mahout recommenders could only recommend to users and interactions known when the job was run.
Descriptions here:
Mahout docs
Slides
Mahout on Spark: What’s New in Recommenders, part 1
Mahout on Spark: What’s New in Recommenders, part 2
Practical Machine Learning ebook
You should run model.predictAll() on a reduced RDD set of (user,product) pairs like in the Mahout Hadoop Job and store the results for online usage...
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
You can use the function .save(sparkContext, outputFolder) to save the model to a folder of your choice. While giving the recommendations in realtime, you just have to use MatrixFactorizationModel.load(sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.
A question to #Sean Owen: Doesn't the MatrixFactorizationObject contain the Factorization matrices: user-feature and item-feature matrices instead of recommendations/predicted ratings.

Resources