Apache Spark: Parallelization of Multiple Machine Learning ALgorithm - apache-spark

Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes training in parallel?
I was not able to find any way to run the different algorithm in parallel. And it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

I generally find people confusing with a word- Distributed. Any programming language or ML algorithm is not distributed. It depends upon the execution engines' collection(data structures). For example Scala is not distributed or more specifically Scala's collections are not distributed. Big data tools like Spark make the collection distributed which get wrapped inside its own data structures and yes I am talking about RDD, Dataframes, LableledPoints, Vectors. These structures make the computing parallel which again depends upon the Partitions.
To answer your question- yes, we can run machine learning in a parallel mode because the data on which any machine learning will tun is distributed among the nodes in a certain n size cluster.

Related

Is Spark good for automatically running statistical analysis script in many nodes for a speedup?

I have a Python script that runs statistical analysis and trained deep learning models on input data. The data size is fairly small (~5Mb) however the speed is slow due to the complexity of the analysis script. I wonder if it would be possible to use Spark to run my script in different nodes on a cluster so that I can gain a speedup. Basically, I want to divide the input data into many subsets and run the analysis script in parallel. Is Spark a good tool for this purpose? Thank you in advance!
As long as you integrate your deep learning model into your pyspark pipeline and use partitioning, you can expect a speedup in the runtime. Without code, it's hard to make specific recommendations, but this article is a good place to start.

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website.
I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.
I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib
So what is the difference between the two frameworks?
Mainly, what are the advantages,down-points and limitations of each?
The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs.
In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.
Warning--major edit:
MLlib is a loose collection of high-level algorithms that runs on Spark. This is what Mahout used to be only Mahout of old was on Hadoop Mapreduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).
The most significant thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and environment including an interactive Scala shell. Perhaps the most important word is "generalized". Since it runs on Spark anything available in MLlib can be used with the linear algebra engine of Mahout-Spark.
If you need a general engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a specific algorithm, look at each to see what they have. For instance Kmeans runs in MLlib but if you need to cluster A'A (a cooccurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A (actually Mahout does a thin-optimized A'A so the transpose is optimized out).
Mahout also includes some innovative recommender building blocks that offer things found in no other OSS.
Mahout still has its older Hadoop algorithms but as fast compute engines like Spark become the norm most people will invest there.

Resources