How to use MLlib in spark SQL - apache-spark

Lately, I've been learning about spark sql, and I wanna know, is there any possible way to use mllib in spark sql, like :
select mllib_methodname(some column) from tablename;
here, the "mllib_methodname" method is a mllib method.
Is there some example shows how to use mllib methods in spark sql?
Thanks in advance.

The new pipeline API is based on DataFrames, which is backed by SQL. See
http://spark.apache.org/docs/latest/ml-guide.html
Or you can simply register the predict method from MLlib models as UDFs and use them in your SQL statement. See
http://spark.apache.org/docs/latest/sql-programming-guide.html#udf-registration-moved-to-sqlcontextudf-java--scala

Related

Spark Datasets available in Python?

Here, it is stated:
..you can create Datasets within a Scala or Python..
while here, the following is stated:
Python does not have the support for the Dataset API
Are datasets available in python?
Perhaps the question is about Typed Spark Datasets.
If so, then the answer is no.
Mentioned spark datasets are only available in Scala and Java.
In Python implementation of Spark (or PySpark) you have to choose between DataFrames as the preferred choice and RDD.
Reference:
RDD vs. DataFrame vs. Dataset
Update 2022-09-26: Clarification regarding typed spark datasets

How to use ONNX models for inference in Spark

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.
Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.
Is there a better way of doing this? Am I doing things "the right way"?
I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.
EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

Optimization Routine for Logistic Regression in ML (Spark 1.6.2)

Dear Apache Spark Comunity:
I've been reading Spark's documentation several weeks. I read Logistic Regression in MLlib and I realized that Spark uses two kinds of optimizations routines (SGD and L-BFGS).
But, currently I'm reading the documentation of LogistReg in ML. I couldn't see explicitly what kind of optimization routine devlopers used. How can I request this information?
With many thanks.
The great point is about the API that they are using.
The MlLib is focus in RDD API. The core of Spark, but some of the process like Sums, Avgs and other kind of simple functions take more time thatn the DataFrame process.
The ML is a library that works with dataframe. That dataFrame has the query optimization for basic functions like sums and some kind close of that.
You can check this blog post and this is one of the reasons that ML should be faster than MlLib.

pySpark: Save ML Model

Can someone please give an example of how you would save a ML model in pySpark?
For
ml.classification.LogisticRegressionModel
I try to use the following:
model.save("path")
but it does not seem to work.
If I understand your question correctly, your method signature is incorrect.
According to the docs you also need to pass in your spark context.
Docs: https://spark.apache.org/docs/1.6.1/api/python/pyspark.mllib.html?highlight=save#pyspark.mllib.classification.LogisticRegressionModel.save
In Spark 2.3.0, if you are using ML:
model.save("path")
Refer: Spark ML model .save
(I just ran LogisticRegression and saved it.)
But if you are using mllib, then as the other answer suggests use:
save(sc, path)
Refer: Spark MLLib model .save

Resources