Does spark support matrices? - apache-spark

Most algorithms that use matrix operations in spark have to use either Vectors or store their data in a different way. Is there support for building matrices directly in spark?

Apache recently released Spark-1.0. It has support for creating Matrices in Spark, which is a really appealing idea. Although right now it is in experimental phase and has support for limited operations that can be performed over the Matrix you create but this is sure to grow in future releases. The idea of Matrix operations being performed with the speed of Spark is amazing.

The way I use matrices in Spark is through python and with numpy scipy. Pull the data into the matrices from a csv file and use as needed. I treated the matrices the same as I would in normal python scipy. It is how you parallelize the data that makes it slightly different.
Something like this:
for i in range(na+2):
data.append(LabeledPoint(b[i], A[i,:]))
model = WhatYouDo.train(sc.parallelize(data), iterations=40, step=0.01,initialWeights=wa)
The pain was getting numpy scipy into spark. Found the best way to make sure all the other libraries and files need were included was to use:
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose

Related

Parallelizing GPflow 2.0 GP regression for large datasets

I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.

RobustScaler partial_fit() similar to MinMaxScaler or StandardScaler

I have been using RobustScaler to scale data and recently we added additional data that is pushing the memory limits of fit_transform. I was hoping to do partial_fit in subset data but looks like RobustScaler does not provide that functionality. Most of the other scalers (MinMax, Standard, Abs) seem to have partial_fit.
Since I have outliers in the data, I need to use RobustScaler. I tried using MinMax and Standard scalers but outliers influence the data too much.
I was hoping to find an alternative to doing fit_transform for large dataset, similar to partial_fit in other scalers.
If it is not a hard requirement for you to use scikit-learn, you can perhaps check out another library for Biomolecular Dynamics called msmbuilder.
It claims to have RobustScaler similar to scikit-learn and with the option of using partial_fit, this is as per their documentation.
Link: http://msmbuilder.org/3.7.0/_preprocessing/msmbuilder.preprocessing.RobustScaler.html#msmbuilder.preprocessing.RobustScaler
PS: I have not tested it.

PyMC3/Edward/Pyro on Spark?

Has anyone tried using a python probabilistic programming library with Spark? Or does anyone have a good idea of what it would take?
I have a feeling Edward would be simplest because there are already tools connecting Tensorflow and Spark, but still hazy about what low-level code changes would be required.
I know distributed MCMC is still an area of active research (see MC-Stan on Spark?), so is this even reasonable to implement? Thanks!
You can use Tensorflow connectors with Edward since it is based on Tensorflow, one of the main drawbacks of MCMC is very computational intensive, you may try Variational inference for your Bayesian models it approximates the target distribution. (this also applies to Pyro and PyMC3 I believe), you can also work with Tensorflow distributed tensorflow distributed
I also recommend you to use/try a library called "Dask
"https://dask.pydata.org/en/latest/Dask, you can scale your model from your workstation to a cluster it also has Tensorflow connectors.
Hope this helps
I've seen people run Pyro+PyTorch in PySpark, but the use case was CPU-only and did not involve distributed training.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

Are GraphFrames compatible with typed Dataset?

We currently use typed Dataset in our work. And we are currently exploring using Graphframes.
However, Graphframes seem to be based on Dataframe which is Dataset[Row]. Would Graphframes be compatible with typed Dataset. e.g. Dataset[Person]
GrahpFrames support only DataFrames. To use statically Dataset you have convert it to DataFrame, apply graph operations, and convert back to statically structure.
You can follow this issue: https://github.com/graphframes/graphframes/issues/133

Resources