Can Spark and the ScalaNLP library Breeze be used together? - apache-spark

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!

Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

Related

Is it possible to load data on Spark workers directly into Apache Arrow in-memory format without first loading it into Spark's in-memory format?

We have a use case for doing a large number of vector multiplications and summing the results such that the input data typically will not fit into the RAM of a single host, even if using 0.5 TB RAM EC2 instances (fitting OLS regression models). Therefore we would like to:
Leverage PySpark for Spark's traditional capabilities (distributing the data, handling worker failures transparently, etc.)
But also leverage C/C++-based numerical computing for doing the actual math on the workers
The leading path seems to be to leverage Apache Arrow with PySpark, and use Pandas functions backed by NumPy (in turn written in C) for the vector products. However, I would like to load the data directly to Arrow format on Spark workers. The existing PySpark/Pandas/Arrow documentation seems to imply that the data is in fact loaded into Spark's internal representation first, then converted into Arrow when Pandas UDFs are called: https://spark.apache.org/docs/3.0.1/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size
I found one related paper to this in which the authors developed a zero-copy Arrow-based interface for Spark, so I take it that this is a highly custom thing to do not currently supported in Spark: https://users.soe.ucsc.edu/~carlosm/dev/publication/rodriguez-arxiv-21/rodriguez-arxiv-21.pdf
I would like to ask if anyone knows of a simple way other than what is described in this paper. Thank you!

Explain the connection between spark libraries, such as SparkSQL, MLib, GraphX and Spark Streaming

Explain the connection between libraries, such as SparkSQL, MLib, GraphX and Spark Streaming,and the core Spark platform
Basically, Spark is the base, an engine that allows the large-scale data processing with high performance. It provides an interface for programming with implicit data parallelism and fault tolerance.
GraphX, MLlib, Spark Streaming and Spark SQL are modules built on top of this engine, each of this has a different goal. Each of these libraries has new objects and functions that provide support for certain types of structures or features.
For example:
GraphX is a distributed graph processing module which allows representing a graph and applies efficient transformations, partitions and algorithms specialized for this kind of structure.
MLlib is a distributed machine learning module on top of Spark which implements certain algorithms like classification, regression, clustering,...
Spark SQL introduce the notion of DataFrames, the most important structure in this module, which allows applying SQL operations (e.g. select, where, groupBy, ...)
Spark Streaming is an extension of the core Spark which ingests data in mini-batches and performs transformations on those mini-batches of data. Spark Streaming has support built-in to consume from Kafka, Flume, and others platforms
You can combine these modules according to your need. For example, if you want to process a large graph for applying a clustering algorithm, then you can use the representation provided by GraphX and use MLlib for apply K-means on this representation.
Doc

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

Scalable invocation of Spark MLlib 1.6 predictive model w/a single data record

I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.

Optimization Routine for Logistic Regression in ML (Spark 1.6.2)

Dear Apache Spark Comunity:
I've been reading Spark's documentation several weeks. I read Logistic Regression in MLlib and I realized that Spark uses two kinds of optimizations routines (SGD and L-BFGS).
But, currently I'm reading the documentation of LogistReg in ML. I couldn't see explicitly what kind of optimization routine devlopers used. How can I request this information?
With many thanks.
The great point is about the API that they are using.
The MlLib is focus in RDD API. The core of Spark, but some of the process like Sums, Avgs and other kind of simple functions take more time thatn the DataFrame process.
The ML is a library that works with dataframe. That dataFrame has the query optimization for basic functions like sums and some kind close of that.
You can check this blog post and this is one of the reasons that ML should be faster than MlLib.

Resources