How to use functions from sklearn into pyspark - apache-spark

I have a training set with 201,917 rows, 3 features and 1 target. My aim is to calculate the strength of the relationship of the individual features with the target. My choice of method for this is sklearn.feature_selection.mutual_info_regression because it works for continuous variables and can detect non-linear relationships better than the counterpart sklearn.feature_selection.f_regression. This is the line I tried to run -
feature_selection.mutual_info_regression(trainPD[['feature_1']],trainPD['target'])
Now the problem is if I run sklearn.feature_selection.mutual_info_regression in Colab, the system crashes. Hence my idea was to shift to pyspark. But pyspark.ml does not have support for sklearn.feature_selection.mutual_info_regression. So what are my options to use sklearn.feature_selection.mutual_info_regression in pyspark?
I am not sure if pandas_udf will help because here it is not the traditional pd.Series -> pd.Series conversion where pyspark parallelization works.

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

Parallelization of sklearn functions using MPI without cross-validation

I have a group of time series which I want to apply a LASSO regression using sklearn on them. As the datasets is pretty sparse I need whole length of time series so that I can't cross-validate. The datasets are big and training process is time consuming which I have to run it on a cluster.
In order to use different nodes I use MPI. As far as I know there is possibility to use sklearn function on cluster using MPI. This possibility basically works with cross-validation chunks, like following issue:
https://github.com/sebp/scikit-learn-mpi-grid-search
I was wondering if there is any other way to use MPI to parallelize process of training in sklearn without cross-validation? I think it would mean that underlying algorithm of sklearn function should use parallelization.

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

how to do multiple target linear regression in Spark MLLib?

Spark ML LinearRegression seems to regress against a single label.
LabeledPoint(label: Double, features: Array[Double])
https://spark.apache.org/docs/0.8.1/api/mllib/org/apache/spark/mllib/regression/LabeledPoint.html
However, with my problem, I need to predict a vector
e.g.
LabeledPoint(label: Array[Double], features: Array[Double])
Is there a way for me to do this? (this is supported in sickit-learn and I am trying to do it in spark)
ps 1: If this is not possible in MLLib directly, is there a tutorial on how to implement this from scratch using spark?
ps 2: My output labels is a 60 element vector. So I could run a LinearRegression 60 times and then run 60 predictions to predict. But that seems like a hack
There is no native implementation from what I've known but if you look at the scikit-learn implementation for Multioutput regression it says that the "strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor".
This means that a potential implementation could be to parallelize the regression step for each target. You could then distribute the calculation at the same time to speed things up.

Resources