Considering a MySQL products database with 10 millions products for an e-commerce website.
I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.
I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib
So what is the difference between the two frameworks?
Mainly, what are the advantages,down-points and limitations of each?
The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs.
In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.
Warning--major edit:
MLlib is a loose collection of high-level algorithms that runs on Spark. This is what Mahout used to be only Mahout of old was on Hadoop Mapreduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).
The most significant thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and environment including an interactive Scala shell. Perhaps the most important word is "generalized". Since it runs on Spark anything available in MLlib can be used with the linear algebra engine of Mahout-Spark.
If you need a general engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a specific algorithm, look at each to see what they have. For instance Kmeans runs in MLlib but if you need to cluster A'A (a cooccurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A (actually Mahout does a thin-optimized A'A so the transpose is optimized out).
Mahout also includes some innovative recommender building blocks that offer things found in no other OSS.
Mahout still has its older Hadoop algorithms but as fast compute engines like Spark become the norm most people will invest there.
Related
What's the difference between spark ML and system ML ?
There is Problem solved by both system ml and spark ml in apache spark engine on IBM, want to know what the main difference
?
Apache Spark is a distributed, data-parallel framework with rich primitives such as map, reduce, join, filter, etc. Additionally, it powers a stack of “libraries” including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
Apache SystemML is a flexible, scalable machine learning (ML) system, enabling algorithm customization and automatic optimization. SystemML’s distinguishing characteristics are:
Algorithm customizability via R-like and Python-like languages.
Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
A more useful comparison would be Spark’s MLLib and SystemML:
Like MLLib, SystemML is can run on top of Apache Spark [batch-mode or programmatic-api].
But unlike MLLib (which has a fixed runtime plan), SystemML has an optimizer that adapts the runtime plan based on the input data and cluster characteristics.
Both MLLib and SystemML can accept the input data as Spark’s DataFrame.
MLLib’s algorithms are written in Scala using Spark’s primitives. At high-level, there are two users of MLLib: (1) Expert developers who implement their algorithm in Scala and have deep understanding of Spark’s core. (2) Non-expert data-scientists who wants to use MLLib as black-box and tweak the hyperparameters. Both these users heavily rely on the initial assumptions of the input data and cluster characteristics. If those assumptions are not valid for a given use-case in production, the user can get poor performance or even OOM.
SystemML’s algorithms are implemented using a high-level (linear algebra friendly) language and its optimizer dynamically compiles the runtime plan based on the input data and cluster characteristics. To simpify the usage, SystemML comes with bunch of pre-implemented algorithms along with MLLib-like wrappers.
Unlike MLLib, SystemML’s algorithm can be used on other backends: such as Apache Hadoop, Embedded In-memory, GPU and may be in future Apache Flink.
Examples of machine learning systems with cost-based optimizer (similar to SystemML): Mahout Samsara, Tupleware, Cumulon, Dmac and SimSQL.
Examples of machine learning library with a fixed plan (similar to MLLib): Mahout MR, MADlib, ORE, Revolution R and HP’s Distributed R.
Examples of distributed systems with domain specific languages (similar to Apache Spark): Apache Flink, REEF and GraphLab.
As I checked the Spark MLib, there are limited classification algorithms included.
Is it possible to add Weka jar file to the spark and use its classification algorithms on Spark workers.
I know there is not any parallelization benefit with doing this job, but I am curious to know if I could do so.
HamidReza,
You cannot really use directly weka jars to distribute your learning algorithms. Their implementations are not compatible with distributed systems.
Nevertheless, there exists a specific project named distributedWekaSparkDev but I haven't tried it yet.
You can learn more about it here : http://markahall.blogspot.com/2017/07/integrating-spark-mllib-into-weka.html
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes training in parallel?
I was not able to find any way to run the different algorithm in parallel. And it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.
I generally find people confusing with a word- Distributed. Any programming language or ML algorithm is not distributed. It depends upon the execution engines' collection(data structures). For example Scala is not distributed or more specifically Scala's collections are not distributed. Big data tools like Spark make the collection distributed which get wrapped inside its own data structures and yes I am talking about RDD, Dataframes, LableledPoints, Vectors. These structures make the computing parallel which again depends upon the Partitions.
To answer your question- yes, we can run machine learning in a parallel mode because the data on which any machine learning will tun is distributed among the nodes in a certain n size cluster.
I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)
Dear Apache Spark Comunity:
I've been reading Spark's documentation several weeks. I read Logistic Regression in MLlib and I realized that Spark uses two kinds of optimizations routines (SGD and L-BFGS).
But, currently I'm reading the documentation of LogistReg in ML. I couldn't see explicitly what kind of optimization routine devlopers used. How can I request this information?
With many thanks.
The great point is about the API that they are using.
The MlLib is focus in RDD API. The core of Spark, but some of the process like Sums, Avgs and other kind of simple functions take more time thatn the DataFrame process.
The ML is a library that works with dataframe. That dataFrame has the query optimization for basic functions like sums and some kind close of that.
You can check this blog post and this is one of the reasons that ML should be faster than MlLib.