Parallelization of sklearn functions using MPI without cross-validation - scikit-learn

I have a group of time series which I want to apply a LASSO regression using sklearn on them. As the datasets is pretty sparse I need whole length of time series so that I can't cross-validate. The datasets are big and training process is time consuming which I have to run it on a cluster.
In order to use different nodes I use MPI. As far as I know there is possibility to use sklearn function on cluster using MPI. This possibility basically works with cross-validation chunks, like following issue:
https://github.com/sebp/scikit-learn-mpi-grid-search
I was wondering if there is any other way to use MPI to parallelize process of training in sklearn without cross-validation? I think it would mean that underlying algorithm of sklearn function should use parallelization.

Related

Sklearn neural network with maximum number of cores available?

I want to use MLPRegressor from sklearn with all 12 cores available to me, however I do not see any option to select the amount of cores (such as with RandomForestClassifier which has the option with n_jobs).
Is there another way to make sure it uses all 12 cores? I vaguely heard about joblib, but how would I use it correctly?
MLPRegressor does not contain any multithreading per se, though the matrix operations will be vectorized and parallelized via numpy.
You may be able to get better performance by varying your batch size, but if performance is critical you should use a deep learning library like Tensorflow.

Parallelizing GPflow 2.0 GP regression for large datasets

I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.

Keras use multi-gpu without Model object (not for training)

I have a bunch of tensor operations (matmul, transpose, etc..) I would like to run on a large dataset.
Since they are still matrix operations, and since I am using Keras generators to load the data batches, It would make sense to use GPUs to compute them.
Now, I've searched a while and I can't seem to find which is the correct way to use Keras to do parallel GPU operations, using generators, outside of the standard Model object interface.
Does anyone know how to do it? Thanks!

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

how to do multiple target linear regression in Spark MLLib?

Spark ML LinearRegression seems to regress against a single label.
LabeledPoint(label: Double, features: Array[Double])
https://spark.apache.org/docs/0.8.1/api/mllib/org/apache/spark/mllib/regression/LabeledPoint.html
However, with my problem, I need to predict a vector
e.g.
LabeledPoint(label: Array[Double], features: Array[Double])
Is there a way for me to do this? (this is supported in sickit-learn and I am trying to do it in spark)
ps 1: If this is not possible in MLLib directly, is there a tutorial on how to implement this from scratch using spark?
ps 2: My output labels is a 60 element vector. So I could run a LinearRegression 60 times and then run 60 predictions to predict. But that seems like a hack
There is no native implementation from what I've known but if you look at the scikit-learn implementation for Multioutput regression it says that the "strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor".
This means that a potential implementation could be to parallelize the regression step for each target. You could then distribute the calculation at the same time to speed things up.

Resources