What's the difference between sparkML and systemML? - apache-spark

What's the difference between spark ML and system ML ?
There is Problem solved by both system ml and spark ml in apache spark engine on IBM, want to know what the main difference
?

Apache Spark is a distributed, data-parallel framework with rich primitives such as map, reduce, join, filter, etc. Additionally, it powers a stack of “libraries” including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
Apache SystemML is a flexible, scalable machine learning (ML) system, enabling algorithm customization and automatic optimization. SystemML’s distinguishing characteristics are:
Algorithm customizability via R-like and Python-like languages.
Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
A more useful comparison would be Spark’s MLLib and SystemML:
Like MLLib, SystemML is can run on top of Apache Spark [batch-mode or programmatic-api].
But unlike MLLib (which has a fixed runtime plan), SystemML has an optimizer that adapts the runtime plan based on the input data and cluster characteristics.
Both MLLib and SystemML can accept the input data as Spark’s DataFrame.
MLLib’s algorithms are written in Scala using Spark’s primitives. At high-level, there are two users of MLLib: (1) Expert developers who implement their algorithm in Scala and have deep understanding of Spark’s core. (2) Non-expert data-scientists who wants to use MLLib as black-box and tweak the hyperparameters. Both these users heavily rely on the initial assumptions of the input data and cluster characteristics. If those assumptions are not valid for a given use-case in production, the user can get poor performance or even OOM.
SystemML’s algorithms are implemented using a high-level (linear algebra friendly) language and its optimizer dynamically compiles the runtime plan based on the input data and cluster characteristics. To simpify the usage, SystemML comes with bunch of pre-implemented algorithms along with MLLib-like wrappers.
Unlike MLLib, SystemML’s algorithm can be used on other backends: such as Apache Hadoop, Embedded In-memory, GPU and may be in future Apache Flink.
Examples of machine learning systems with cost-based optimizer (similar to SystemML): Mahout Samsara, Tupleware, Cumulon, Dmac and SimSQL.
Examples of machine learning library with a fixed plan (similar to MLLib): Mahout MR, MADlib, ORE, Revolution R and HP’s Distributed R.
Examples of distributed systems with domain specific languages (similar to Apache Spark): Apache Flink, REEF and GraphLab.

Related

Run on spark a weka Classification Algorithm

As I checked the Spark MLib, there are limited classification algorithms included.
Is it possible to add Weka jar file to the spark and use its classification algorithms on Spark workers.
I know there is not any parallelization benefit with doing this job, but I am curious to know if I could do so.
HamidReza,
You cannot really use directly weka jars to distribute your learning algorithms. Their implementations are not compatible with distributed systems.
Nevertheless, there exists a specific project named distributedWekaSparkDev but I haven't tried it yet.
You can learn more about it here : http://markahall.blogspot.com/2017/07/integrating-spark-mllib-into-weka.html

Explain the connection between spark libraries, such as SparkSQL, MLib, GraphX and Spark Streaming

Explain the connection between libraries, such as SparkSQL, MLib, GraphX and Spark Streaming,and the core Spark platform
Basically, Spark is the base, an engine that allows the large-scale data processing with high performance. It provides an interface for programming with implicit data parallelism and fault tolerance.
GraphX, MLlib, Spark Streaming and Spark SQL are modules built on top of this engine, each of this has a different goal. Each of these libraries has new objects and functions that provide support for certain types of structures or features.
For example:
GraphX is a distributed graph processing module which allows representing a graph and applies efficient transformations, partitions and algorithms specialized for this kind of structure.
MLlib is a distributed machine learning module on top of Spark which implements certain algorithms like classification, regression, clustering,...
Spark SQL introduce the notion of DataFrames, the most important structure in this module, which allows applying SQL operations (e.g. select, where, groupBy, ...)
Spark Streaming is an extension of the core Spark which ingests data in mini-batches and performs transformations on those mini-batches of data. Spark Streaming has support built-in to consume from Kafka, Flume, and others platforms
You can combine these modules according to your need. For example, if you want to process a large graph for applying a clustering algorithm, then you can use the representation provided by GraphX and use MLlib for apply K-means on this representation.
Doc

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website.
I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.
I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib
So what is the difference between the two frameworks?
Mainly, what are the advantages,down-points and limitations of each?
The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs.
In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.
Warning--major edit:
MLlib is a loose collection of high-level algorithms that runs on Spark. This is what Mahout used to be only Mahout of old was on Hadoop Mapreduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).
The most significant thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and environment including an interactive Scala shell. Perhaps the most important word is "generalized". Since it runs on Spark anything available in MLlib can be used with the linear algebra engine of Mahout-Spark.
If you need a general engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a specific algorithm, look at each to see what they have. For instance Kmeans runs in MLlib but if you need to cluster A'A (a cooccurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A (actually Mahout does a thin-optimized A'A so the transpose is optimized out).
Mahout also includes some innovative recommender building blocks that offer things found in no other OSS.
Mahout still has its older Hadoop algorithms but as fast compute engines like Spark become the norm most people will invest there.

Resources