Custom Scoring Function in Spark Machine Learning - apache-spark

In sklearn there is the make_scorer function that allows you to define custom scoring functions to be inserted into GridsearchCV. How can I do the same thing in Spark ML as well? For example, in the MultiClassClassificationEvaluator in spark, there is no provision to pass in a custom scoring function.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html
Ideally, I would like the below interface. Has anyone done this before?
evaluator = MulticlassClassificationEvaluator(metricName=CustomScoringFunction)

Related

Pyspark ML CrossValidator evaluate several evaluators

In Sklearn in the GridSearchCV we can give the model different scorings and with the refit param we refit one of them using the best found parameters in on the whole dataset.
Is there any way to do something similar with CrossValidator from the ML package from pyspark?

How do you implement a model built using sklearn pipeline in pyspark?

I would like to use the model I built using sklearn pipeline in pyspark. The pipeline takes care of imputation, scaling and one-hot encoding and Random Forest Classification.I tried broadcasting the model and using pandas udf to predict.it did not work, got py4jjavaerror.

How can I use Hyperopt with MLFlow within a pandas_udf?

I'm building multiple Prophet models where each model is passed to a pandas_udf function which trains the model and stores the results with MLflow.
#pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast(data):
......
with mlflow.start_run() as run:
......
Then I call this UDF which trains a model for each KPI.
df.groupBy('KPI').apply(forecast)
The idea is that, for each KPI a model will be trained with multiple hyperparameters and store the best params for each model in MLflow. I would like to use Hyperopt to make the search more efficient.
In this case, where should I place the objective function? Since the data is passed to the UDF for each model I thought of creating an inner function within the UDF that uses the data for each run. Does this make sense?
if I remember correctly, you couldn't do it because it would be something like nested Spark execution, and it won't work with Spark. You'll need to have to change approach to something like:
for kpi in list_of_kpis:
run_hyperopt_tuning
if you need to tune parameters for every KPI model separately - because it will optimize parameters separately.
If KPI is like a hyperparameter of the model, then you can just include list of KPIs into search space, and load necessary data inside the function that doing the training & evaluation.

MLeap and Spark ML SQLTransformer

I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns.
As we all know, Mleap doesn't support SQLTransformer - see here :
https://github.com/combust/mleap/issues/126
so I've implemented the former of these 2 suggestions:
For non-row operations, move the SQL out of the ML Pipeline that you
plan to serialize
For row-based operations, use the available ML
transformers or write a custom transformer <- this is where the
custom transformer documentation will help.
I've externalized the SQL transformation on the training data used to build the model, and I do the same for the input data when I run the model for evaluation.
The problem I'm having is that I'm unable to obtain the same results across the 2 models.
Model 1 - Pure Spark ML model containing
SQLTransformer + later transformations : StringIndexer ->
OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier
Model 2 - Externalized version with SQL queries run on training data in building the model. The transformations are
everything after SQLTransformer in Model 1:
StringIndexer -> OneHotEncoderEstimator ->
VectorAssembler -> RandomForestClassifier
I'm wondering how I could go about debugging this problem. Is there a way to somehow compare the results after each stage to see where the differences show up ?
Any suggestions are appreciated.

what is the difference between "sklearn.cluster.k_means" and "sklearn.cluster.KMeans" when I should use one of them?

I am confusing about the difference between "sklearn.cluster.k_means" and "sklearn.cluster.KMeans" when I should use one of them?
From the sklearn glossary: "[w]e provide ad hoc function interfaces for many algorithms, while estimator classes provide a more consistent interface." k_means() is just a wrapper that returns the result of KMeans.fit():
cluster_centers_,
labels_,
inertia_,
n_iter_
KMeans is a class designed following the developer guide for sklearn objects. KMeans, like other classifier objects in sklearn, must implement methods for:
fit(),
transform(), and
score().
and can also implement other methods like predict(). The main benefit of using KMeans over k_means() is that you have easy access to the other methods implemented in KMeans. For example, if you want to use your trained model to predict which cluster unseen data belongs to:
from sklearn.clustering import KMeans
est = KMeans()
KMeans.fit(X_train)
cluster_labels = est.predict(X_test)
If you use the functional API, to apply the prediction you would have to look under the hood of KMeans.predict() to figure out how to do this.
The functional design is not implemented for all sklearn objects, but you can easily implement this yourself using other examples from sklearn to guide you.

Resources