Could you let me know the best practices for conducting performance testing of my PMML application which is based on JPMML evaluator and/or our own PMML sort of scoring engine. I found some sample PMML and corresponding test data at http://dmg.org/pmml/pmml_examples/index.html. But I am looking for very large data (which represent the actual customer transaction dataset). Also, I come to know the JPMML 1.2.6 is around 10x faster w.r.t. 1.2 but consumes some extra memory. What are the best practices to verify it on large data set (GBs of dataset)
Related
I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)
In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.
Excerpt from the Apache Spark Streaming MLlib link:
" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details.
"
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?
Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.
Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
Yes.
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.
I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.
I am using Spark to build a recommendation system prototype. After going through some tutorials, I have been able to train a MatrixFactorizationModel from my data.
However, the model trained by Spark mllib is just a Serializable. How can I use this model to do recommendation for real users? I mean, how can I persist the model into some sort of database or update it if the user data has been incremented?
For example, the model trained by Mahout recommendation library can be stored into databases like Redis, then we can query for the recommended item list later. But how can we do similar stuff in Spark? Any suggestion?
First, the "model" you're referring to from Mahout is not a model, but a pre-computed list of recommendations. You could also do this with Spark, and compute in batch recommendations for users, and persist them anywhere you like. This has nothing to do with serializing a model. If you don't want to do real-time updates or scoring, you can stop there and just use Spark for batch just like you do Mahout.
But I agree that in a lot of cases you do want to ship the model somewhere else and serve it. As you can see, other models in Spark are Serializable, but not MatrixFactorizationModel. (Yes, even though it's marked as such, it won't serialize.) Likewise, there is a standard serialization for predictive models called PMML but it contains no vocabulary for a factored matrix model.
The reason is actually the same. Whereas many predictive models, like an SVM or logistic regression model, are just a small set of coefficients, a factored matrix model is huge, containing two matrices with potentially billions of elements. That is why I think PMML doesn't have any reasonable encoding for it.
Likewise, in Spark, that means the actual matrices are RDDs that can't be serialized directly. You can persist these RDDs to storage, re-read them elsewhere using Spark, and recreate a MatrixFactorizationModel by hand that way.
You can't serve or update the model using Spark though. For this you are really looking at writing some code to perform updates and calculate recommendations on the fly.
I don't mind suggesting here the Oryx project, since its point is to manage exactly this aspect, particularly for ALS recommendation. In fact, the Oryx 2 project is based on Spark and although in alpha, already contains the complete pipeline to serialize and serve the output of MatrixFactorizationModel. I don't know if it meets your needs, but may at least be an interesting reference point.
Another method for creating recs with Spark is the search engine method. This is basically a cooccurrence recommender served by Solr or Elasticsearch. Comparing factorized to cooccurrence is beyond this question so I'll just describe the latter.
You feed interactions (user-id,item-id) into Mahout's spark-itemsimilarity. This produces a list of similar items for every item seen in the interaction data. It will come out by default as a csv and so can be stored anywhere. But it needs to be indexed by a search engine.
In any case when you want to fetch recs you use the user's history as the query, you get back an ordered list of items as recs.
One benefit of this method is that indicators can be calculated for as many user actions as you want. Any action the user takes that correlates to what you want to recommend can be used. For instance if you want to recommend a purchase but you record product-views as well. If you treated product-views the same as purchases you would likely get worse recs (I've tried it). However if you calculate an indicator for purchases and another (actually cross-cooccurrence) indicator for product-views they are equally predictive of purchases. This has the effect of increasing the data used for recs. The same type of thing can be done with user locations to blend in location information into purchase recs.
You can also bias your recs based on context. If you are in the "electronics" section of a catalog, you may want recs to be skewed towards electronics. Add electronics to the query against the item's "category" metadata field and give it a boost in the query and you have biased recs.
Since all of the biasing and mixing of indicators happens in the query it makes the recs engine easily tuned to multiple contexts while maintaining only one multi-field query made through a search engine. We get scalability from Solr or Elasticsearch.
One other benefit of either factorization or the search method is that entirely new users and new history can be used to create recs where the older Mahout recommenders could only recommend to users and interactions known when the job was run.
Descriptions here:
Mahout docs
Slides
Mahout on Spark: What’s New in Recommenders, part 1
Mahout on Spark: What’s New in Recommenders, part 2
Practical Machine Learning ebook
You should run model.predictAll() on a reduced RDD set of (user,product) pairs like in the Mahout Hadoop Job and store the results for online usage...
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
You can use the function .save(sparkContext, outputFolder) to save the model to a folder of your choice. While giving the recommendations in realtime, you just have to use MatrixFactorizationModel.load(sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.
A question to #Sean Owen: Doesn't the MatrixFactorizationObject contain the Factorization matrices: user-feature and item-feature matrices instead of recommendations/predicted ratings.
I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark.
My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts.
In my app data "events" are inserted in real-time, but to get updated prediction results I need to "pio train" and "pio deploy". This takes some time and the server goes offline during the redeploy.
I'm trying to figure out if I can do incremental training during the "predict" phase, but cannot find an answer.
I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.
Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.
So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:
https://github.com/brkyvz/streaming-matrix-factorization
The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.
The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.
In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.
Interesting notes for SGD updates:
http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf
For updating Your model near-online (I write near, because face it, the true online update is impossible) by using fold-in technique, e.g.:
Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems.
Ou You can look at code of:
MyMediaLite
Oryx - framework build with Lambda Architecture paradigm. And it should have updates with fold-in of new users/items.
It's the part of my answer for similar question where both problems: near-online training and handling new users/items were mixed.