As I checked the Spark MLib, there are limited classification algorithms included.
Is it possible to add Weka jar file to the spark and use its classification algorithms on Spark workers.
I know there is not any parallelization benefit with doing this job, but I am curious to know if I could do so.
HamidReza,
You cannot really use directly weka jars to distribute your learning algorithms. Their implementations are not compatible with distributed systems.
Nevertheless, there exists a specific project named distributedWekaSparkDev but I haven't tried it yet.
You can learn more about it here : http://markahall.blogspot.com/2017/07/integrating-spark-mllib-into-weka.html
Related
I want to train some models using fasttext and since it doesn't use spark, it will be running on my driver. The number of training jobs that will be running simultaneously is very large and so is the size of the data. Is there a way to make it run it on different workers or distribute it across workers?
Is this the best approach or am I better off using a large single node cluster?
FYI, I am using Databricks. So solutions specific to that are also okay.
You can use Databricks multi-node clusters to run training even for the libraries that are effectively single-node, such as scikit-learn, etc. This is typically done using the HyperOpt library that is bundled together with ML runtimes. You will need to define an objective function, but it's implementation depends on the differences of models. Look into this example that shows how to run different algorithms from scikit-learn.
What's the difference between spark ML and system ML ?
There is Problem solved by both system ml and spark ml in apache spark engine on IBM, want to know what the main difference
?
Apache Spark is a distributed, data-parallel framework with rich primitives such as map, reduce, join, filter, etc. Additionally, it powers a stack of “libraries” including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
Apache SystemML is a flexible, scalable machine learning (ML) system, enabling algorithm customization and automatic optimization. SystemML’s distinguishing characteristics are:
Algorithm customizability via R-like and Python-like languages.
Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
A more useful comparison would be Spark’s MLLib and SystemML:
Like MLLib, SystemML is can run on top of Apache Spark [batch-mode or programmatic-api].
But unlike MLLib (which has a fixed runtime plan), SystemML has an optimizer that adapts the runtime plan based on the input data and cluster characteristics.
Both MLLib and SystemML can accept the input data as Spark’s DataFrame.
MLLib’s algorithms are written in Scala using Spark’s primitives. At high-level, there are two users of MLLib: (1) Expert developers who implement their algorithm in Scala and have deep understanding of Spark’s core. (2) Non-expert data-scientists who wants to use MLLib as black-box and tweak the hyperparameters. Both these users heavily rely on the initial assumptions of the input data and cluster characteristics. If those assumptions are not valid for a given use-case in production, the user can get poor performance or even OOM.
SystemML’s algorithms are implemented using a high-level (linear algebra friendly) language and its optimizer dynamically compiles the runtime plan based on the input data and cluster characteristics. To simpify the usage, SystemML comes with bunch of pre-implemented algorithms along with MLLib-like wrappers.
Unlike MLLib, SystemML’s algorithm can be used on other backends: such as Apache Hadoop, Embedded In-memory, GPU and may be in future Apache Flink.
Examples of machine learning systems with cost-based optimizer (similar to SystemML): Mahout Samsara, Tupleware, Cumulon, Dmac and SimSQL.
Examples of machine learning library with a fixed plan (similar to MLLib): Mahout MR, MADlib, ORE, Revolution R and HP’s Distributed R.
Examples of distributed systems with domain specific languages (similar to Apache Spark): Apache Flink, REEF and GraphLab.
I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.
I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.
Considering a MySQL products database with 10 millions products for an e-commerce website.
I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.
I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib
So what is the difference between the two frameworks?
Mainly, what are the advantages,down-points and limitations of each?
The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs.
In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.
Warning--major edit:
MLlib is a loose collection of high-level algorithms that runs on Spark. This is what Mahout used to be only Mahout of old was on Hadoop Mapreduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).
The most significant thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and environment including an interactive Scala shell. Perhaps the most important word is "generalized". Since it runs on Spark anything available in MLlib can be used with the linear algebra engine of Mahout-Spark.
If you need a general engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a specific algorithm, look at each to see what they have. For instance Kmeans runs in MLlib but if you need to cluster A'A (a cooccurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A (actually Mahout does a thin-optimized A'A so the transpose is optimized out).
Mahout also includes some innovative recommender building blocks that offer things found in no other OSS.
Mahout still has its older Hadoop algorithms but as fast compute engines like Spark become the norm most people will invest there.