Convolutional Neural Network in Spark - apache-spark

I'm trying to implement a Convolutional Neural Network algorithm on Spark and I wanted to ask two questions before moving forward.
I need to implement my code such that, it is highly integrated with Spark and also follows the principles of machine learning algorithms in Spark. I found that Spark ML is an established ground for machine learning codes and it has a specific foundation, which all written algorithms are following. Also, the implemented algorithms are offloading their heavy mathematical operations to third party libraries such as BLAS, to do calculations fast.
Now I wanted to ask:
1) Is ML the right place to start? By following the ML structure, does my code going to be highly integrable with the rest of the spark ML ecosystem?
2) Am I right about the bottom of the ML codes, where they offload the processing into another mathematical library? Does it mean I can decide to change that layer to do the heavy processings in a customized fashion?
Would appreciate any suggestions.

Related

Can Azure Machine Learning be applied for manipulations?

I am going thru the samples for Azure Machine Learning. It looks like the examples are leading me to the point that ML is being used to classification problems like ranking, classifying or detecting the category by model trained from inferred-sample-data.
Now that I am wondering if ML can be trained to computational problems like Multiplication, Division, other series problems,..? Does this problem fit in ML scope?
MULTIPLICATION DATASET:
Num01,Num02,Result
1,1,1
1,2,2
1,3,3
1,4,4
1,5,5
1,6,6
1,7,7
1,8,8
1,9,9
1,10,10
1,11,11
1,12,12
1,13,13
1,14,14
2,1,2
2,2,4
2,3,6
2,4,8
2,5,10
2,6,12
2,7,14
2,8,16
2,9,18
2,10,20
2,11,22
2,12,24
2,13,26
2,14,28
3,1,3
3,2,6
SCORING DATASET:
Num01,Num02
1,5
3,1
2,16
3,15
1,32
It seems like you are looking for regression, which is supportd by almost every machine learning library, including Azure's services. In laymans terms, the goal of regression is to approximate an unknown function that maps data X to a continuous value y.
This can be any function, indeed including multiplication or division. However, do note that these cases are usually way too simple to solve with machine learning. Most machine learning algorithms (except maybe linear regression)do a lot more internal computations and will as a result be slower than a native implementation on your device.
As an extra point of clarification, most of the actual machine learning (ML) in Azure ML is done by great open source libraries such as sk-learn or keras. Azure mainly provides compute power and higher-level management tools, such as experiment tracking and efficient hyper-parameter-tuning.
If you are just getting started with ML and want to go more in-depth, then this extra functionality might be overkill/confusing. So I would advise to start with focusing on one of the packages that I described above. Additionally you would need to combine that with some more formal training, which will explain most of the important concepts to you.

PyMC3/Edward/Pyro on Spark?

Has anyone tried using a python probabilistic programming library with Spark? Or does anyone have a good idea of what it would take?
I have a feeling Edward would be simplest because there are already tools connecting Tensorflow and Spark, but still hazy about what low-level code changes would be required.
I know distributed MCMC is still an area of active research (see MC-Stan on Spark?), so is this even reasonable to implement? Thanks!
You can use Tensorflow connectors with Edward since it is based on Tensorflow, one of the main drawbacks of MCMC is very computational intensive, you may try Variational inference for your Bayesian models it approximates the target distribution. (this also applies to Pyro and PyMC3 I believe), you can also work with Tensorflow distributed tensorflow distributed
I also recommend you to use/try a library called "Dask
"https://dask.pydata.org/en/latest/Dask, you can scale your model from your workstation to a cluster it also has Tensorflow connectors.
Hope this helps
I've seen people run Pyro+PyTorch in PySpark, but the use case was CPU-only and did not involve distributed training.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

Spark Streaming - Can an offline model be used against a data stream

In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.
Excerpt from the Apache Spark Streaming MLlib link:
" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details.
"
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?
Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.
Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
Yes.
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.

Resources