The best practice to use Spark's generated mllib model as a server - apache-spark

I am trying to find out what the proper way is to use a model generated by Spark+MLlib (in this case a Collaborative Filtering Recommendation Engine) to provide predictions quickly, on demand, and as a server.
My current solution is to run an instance of Spark continuously for this purpose, but I wanted to know whether there are better solutions to this, perhaps a solution that does not require a running Spark. Perhaps there is a way to load and use a generated model by Spark without involving Spark?

You can export a model via pmml and then take that model and use it in another application.

Now i find the way。 First,we can save als model's product_features and user_feaures by model.productFeatures() and Model.userFeatures()
Then we get product features like this
209699159874445020
0.0533636957407,-0.0878632888198,0.105949401855,0.129774808884,0.0953511446714,0.16420891881,0.0558457262814,0.0587058141828
So we can load product features and user features into two dicts in python and make a server by tornado to predict ratings using these two dicts. I will show the code for example.
def predict(item_id, user_id):
ind = item_id_index[item_id]
gf = goods_features[ind,1:]
ind = user_id_index[user_id]
uf = user_features[ind,1:]
return blas.ddot(gf,uf,len(gf),0,1,0,1)
As conclusion. We need to persist als model by ourselves and it isn't as difficult as we thought. Any suggestions are welcome.

Related

Spark- The purpose of saving ALS model

I'm trying to understand what would be a purpose of storing ALS model and what would be a use case for use of stored model.
I have a dataset which has over 300M rows and I'm using Hadoop Cluster and Spark to calculate recommendations based on ALS algorithm.
Whole computation takes around 5h and I'm wondering what would be the case of storing my model and use it- for example- the next day and... I don't see any. So, either I'm doing something wrong (which is possible, taking into account fact that I'm beginner in ML world) or ALS algorithm in Spark and possibility of saving on disk is not very helpful.
Right now, I use it as following:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(df_input)
df_recommendations = model.recommendForAllUsers(10)
And as I mentioned. df_input is a DataFrame which contains over 300M rows. Total calculation time is around 5h and after that I receive 10 recommended items for each user in the dataset.
In many tutorials or books. There is an example of training the model and validate it with test data. Something like:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
(training, test) = df_input.randomSplit(weights = [0.7, 0.3])
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.train(training)
model.write().save("saved_model")
...
model = ALSModel.load('saved_model')
predictions = model.transform(test) // or df_input to get predictions for each user
I don't see any pros of using it in a such way. However I see a one big cons- You don't use 30% of data to train a model
As far as I know there isn't a way to use ALS model online (in real time). At least without using any external package/library.
You can't incrementally update this model.
You can't use it for newly registered users because there they don't exist in stored Matrix Factorization, so there won't be any recommendations for them.
All you can do is to check what would be a prediction for given user-item pair. Which is basically the same thing which would be return in the first example of code (with used fit() method)
What would be a reason to store this model on disk and load it when needed? or when (what conditions should be met) should I consider to store model and reuse it? Could you provide a use case?

Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.
Note, required method 'partial_fit' available in scikit-learn, but not in Spark.
I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.
Please suggest me an effective way for model update or on-line learning using Spark Mllib?
You cannot update arbitrary models.
On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.
For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.
That is why it is fairly common to build new models every night, for example.
Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.

Scalable invocation of Spark MLlib 1.6 predictive model w/a single data record

I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.

is there a way to visualize Spark mllib Random Forest Model?

I can't seem to find a way to visualize my RF model, obtained using Spark's MLLib RandomForestModel. The model, printed as a string, is just a bunch of nested IF statements.. it seems natural to want to visualize like is possible in R. I am using Spark Python API, and Java API.. open to use anything that will produce an R-like visualization of my RF model.
There is a library out there to help with this, EurekaTrees. Basically it just takes the debug string builds a tree and then displays it as a webpage using d3.js
from Databricks (Oct 2015):
"The plots listed above as Scala-only will soon be available in Python notebooks as well. There are also other machine learning model visualizations on the way. Stay tuned for Decision Tree and Machine Learning Pipeline visualizations!"

Spark - How to use the trained recommender model in production?

I am using Spark to build a recommendation system prototype. After going through some tutorials, I have been able to train a MatrixFactorizationModel from my data.
However, the model trained by Spark mllib is just a Serializable. How can I use this model to do recommendation for real users? I mean, how can I persist the model into some sort of database or update it if the user data has been incremented?
For example, the model trained by Mahout recommendation library can be stored into databases like Redis, then we can query for the recommended item list later. But how can we do similar stuff in Spark? Any suggestion?
First, the "model" you're referring to from Mahout is not a model, but a pre-computed list of recommendations. You could also do this with Spark, and compute in batch recommendations for users, and persist them anywhere you like. This has nothing to do with serializing a model. If you don't want to do real-time updates or scoring, you can stop there and just use Spark for batch just like you do Mahout.
But I agree that in a lot of cases you do want to ship the model somewhere else and serve it. As you can see, other models in Spark are Serializable, but not MatrixFactorizationModel. (Yes, even though it's marked as such, it won't serialize.) Likewise, there is a standard serialization for predictive models called PMML but it contains no vocabulary for a factored matrix model.
The reason is actually the same. Whereas many predictive models, like an SVM or logistic regression model, are just a small set of coefficients, a factored matrix model is huge, containing two matrices with potentially billions of elements. That is why I think PMML doesn't have any reasonable encoding for it.
Likewise, in Spark, that means the actual matrices are RDDs that can't be serialized directly. You can persist these RDDs to storage, re-read them elsewhere using Spark, and recreate a MatrixFactorizationModel by hand that way.
You can't serve or update the model using Spark though. For this you are really looking at writing some code to perform updates and calculate recommendations on the fly.
I don't mind suggesting here the Oryx project, since its point is to manage exactly this aspect, particularly for ALS recommendation. In fact, the Oryx 2 project is based on Spark and although in alpha, already contains the complete pipeline to serialize and serve the output of MatrixFactorizationModel. I don't know if it meets your needs, but may at least be an interesting reference point.
Another method for creating recs with Spark is the search engine method. This is basically a cooccurrence recommender served by Solr or Elasticsearch. Comparing factorized to cooccurrence is beyond this question so I'll just describe the latter.
You feed interactions (user-id,item-id) into Mahout's spark-itemsimilarity. This produces a list of similar items for every item seen in the interaction data. It will come out by default as a csv and so can be stored anywhere. But it needs to be indexed by a search engine.
In any case when you want to fetch recs you use the user's history as the query, you get back an ordered list of items as recs.
One benefit of this method is that indicators can be calculated for as many user actions as you want. Any action the user takes that correlates to what you want to recommend can be used. For instance if you want to recommend a purchase but you record product-views as well. If you treated product-views the same as purchases you would likely get worse recs (I've tried it). However if you calculate an indicator for purchases and another (actually cross-cooccurrence) indicator for product-views they are equally predictive of purchases. This has the effect of increasing the data used for recs. The same type of thing can be done with user locations to blend in location information into purchase recs.
You can also bias your recs based on context. If you are in the "electronics" section of a catalog, you may want recs to be skewed towards electronics. Add electronics to the query against the item's "category" metadata field and give it a boost in the query and you have biased recs.
Since all of the biasing and mixing of indicators happens in the query it makes the recs engine easily tuned to multiple contexts while maintaining only one multi-field query made through a search engine. We get scalability from Solr or Elasticsearch.
One other benefit of either factorization or the search method is that entirely new users and new history can be used to create recs where the older Mahout recommenders could only recommend to users and interactions known when the job was run.
Descriptions here:
Mahout docs
Slides
Mahout on Spark: What’s New in Recommenders, part 1
Mahout on Spark: What’s New in Recommenders, part 2
Practical Machine Learning ebook
You should run model.predictAll() on a reduced RDD set of (user,product) pairs like in the Mahout Hadoop Job and store the results for online usage...
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
You can use the function .save(sparkContext, outputFolder) to save the model to a folder of your choice. While giving the recommendations in realtime, you just have to use MatrixFactorizationModel.load(sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.
A question to #Sean Owen: Doesn't the MatrixFactorizationObject contain the Factorization matrices: user-feature and item-feature matrices instead of recommendations/predicted ratings.

Resources