Spark Streaming - Can an offline model be used against a data stream - apache-spark

In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.
Excerpt from the Apache Spark Streaming MLlib link:
" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details.
"
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.

Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?
Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.
Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
Yes.
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.

Related

Does H2O Sparkling Water allow for Online-Training with Kafka as Streaming Source

I'm currently experimenting with the possibilities of Sparkling-Water. There are a few possible Use-Cases including Data-Munging in H2O/Spark, Model Building and Offline-Training and Online Stream Prediction. I was wondering whether it is also possible to use Sparkling-Water for Online-Training together with a Kafka Streaming Source?
The Deep Learning model in particular can continuously train forever if you keep presenting new data. So you could do online training with that.
Models like DRM and GBM can “add another tree” from new data using a checkpoint, although you really don’t want to end up with infinity trees.
You could keep around a window of data and periodically train a new complete model. (Swapping in a new model instance at runtime is pretty straighforward. So you could just keep training in the background and update the model that predicts on streaming data periodically — like every hour or every few minutes, or whatever).
Or do your own ensembling by averaging the prediction of many models — by periodically throwing away older models and adding newer ones in a conveyor-belt type of strategy. Similar to a moving average.

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

Real-time data standardization / normalization with Spark structured streaming

Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.
Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!
In other words, how to fit models in Spark structured streaming?
I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.
Check:
JIRA - Add support for Structured Streaming to the ML Pipeline API
Google DOC - Machine Learning on Structured Streaming

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

Scalable invocation of Spark MLlib 1.6 predictive model w/a single data record

I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.

Resources