Input Apache Spark MlLib - apache-spark

What kind of input can I give my ML algorithms from Spark?
There are streaming algorithms like StreamingKMeans or StreamingLinearRegression.
These can have an input stream for training and testing.
In addition, there are many other algorithms, such as ALS or Decision Tree, which in the examples on the Spark website have only been trained and tested with static datasets.
My question is if I can take a streaming dataset for algorithms that are not designed for streaming.
For example:
https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . This exmaple only reads from static files. Can I use a streaming input for this algorithm?

Related

What's the difference between sparkML and systemML?

What's the difference between spark ML and system ML ?
There is Problem solved by both system ml and spark ml in apache spark engine on IBM, want to know what the main difference
?
Apache Spark is a distributed, data-parallel framework with rich primitives such as map, reduce, join, filter, etc. Additionally, it powers a stack of “libraries” including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
Apache SystemML is a flexible, scalable machine learning (ML) system, enabling algorithm customization and automatic optimization. SystemML’s distinguishing characteristics are:
Algorithm customizability via R-like and Python-like languages.
Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
A more useful comparison would be Spark’s MLLib and SystemML:
Like MLLib, SystemML is can run on top of Apache Spark [batch-mode or programmatic-api].
But unlike MLLib (which has a fixed runtime plan), SystemML has an optimizer that adapts the runtime plan based on the input data and cluster characteristics.
Both MLLib and SystemML can accept the input data as Spark’s DataFrame.
MLLib’s algorithms are written in Scala using Spark’s primitives. At high-level, there are two users of MLLib: (1) Expert developers who implement their algorithm in Scala and have deep understanding of Spark’s core. (2) Non-expert data-scientists who wants to use MLLib as black-box and tweak the hyperparameters. Both these users heavily rely on the initial assumptions of the input data and cluster characteristics. If those assumptions are not valid for a given use-case in production, the user can get poor performance or even OOM.
SystemML’s algorithms are implemented using a high-level (linear algebra friendly) language and its optimizer dynamically compiles the runtime plan based on the input data and cluster characteristics. To simpify the usage, SystemML comes with bunch of pre-implemented algorithms along with MLLib-like wrappers.
Unlike MLLib, SystemML’s algorithm can be used on other backends: such as Apache Hadoop, Embedded In-memory, GPU and may be in future Apache Flink.
Examples of machine learning systems with cost-based optimizer (similar to SystemML): Mahout Samsara, Tupleware, Cumulon, Dmac and SimSQL.
Examples of machine learning library with a fixed plan (similar to MLLib): Mahout MR, MADlib, ORE, Revolution R and HP’s Distributed R.
Examples of distributed systems with domain specific languages (similar to Apache Spark): Apache Flink, REEF and GraphLab.

Explain the connection between spark libraries, such as SparkSQL, MLib, GraphX and Spark Streaming

Explain the connection between libraries, such as SparkSQL, MLib, GraphX and Spark Streaming,and the core Spark platform
Basically, Spark is the base, an engine that allows the large-scale data processing with high performance. It provides an interface for programming with implicit data parallelism and fault tolerance.
GraphX, MLlib, Spark Streaming and Spark SQL are modules built on top of this engine, each of this has a different goal. Each of these libraries has new objects and functions that provide support for certain types of structures or features.
For example:
GraphX is a distributed graph processing module which allows representing a graph and applies efficient transformations, partitions and algorithms specialized for this kind of structure.
MLlib is a distributed machine learning module on top of Spark which implements certain algorithms like classification, regression, clustering,...
Spark SQL introduce the notion of DataFrames, the most important structure in this module, which allows applying SQL operations (e.g. select, where, groupBy, ...)
Spark Streaming is an extension of the core Spark which ingests data in mini-batches and performs transformations on those mini-batches of data. Spark Streaming has support built-in to consume from Kafka, Flume, and others platforms
You can combine these modules according to your need. For example, if you want to process a large graph for applying a clustering algorithm, then you can use the representation provided by GraphX and use MLlib for apply K-means on this representation.
Doc

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

Real-time data standardization / normalization with Spark structured streaming

Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.
Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!
In other words, how to fit models in Spark structured streaming?
I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.
Check:
JIRA - Add support for Structured Streaming to the ML Pipeline API
Google DOC - Machine Learning on Structured Streaming

Spark Streaming - Can an offline model be used against a data stream

In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.
Excerpt from the Apache Spark Streaming MLlib link:
" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details.
"
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?
Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.
Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
Yes.
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.

Resources