Efficient implementation of SOM (Self organizing map) on Pyspark - apache-spark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!

This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Related

How to better structure linear algebra-heavy code in PySpark?

Need some suggestions on scaling a pipeline on spark that performs Collaborative Filtering for about 200k-1m people, but does so in groups, with the largest group being approx. 40-50k customers at best. In addition to Collaborative Filtering, which is reasonably fast with ALS, there's a lot of linear algebra that occurs that I couldn't really figure out how to perform with the spark Dataframe API, and had to drop down to the RDD API to perform, and that leads to a significant loss in performance. I've currently got multiple variations of this script - in scala, pyspark, and python - and by far the fastest, despite not being distributor/parallelized, is python, where I'm using numpy for all linear algebra tasks, and python for the remaining transformations.
So, to summarize, I've got a pipeline with a lot of complicated linear algebra that spark doesn't seem to have performant native data structures for, and the workarounds I've devised - RDDs level manipulations for most operations, parallelizing and broadcasting the RDDs to perform matmul in chunks, etc - are significantly slower than just performing the operations in-memory on numpy.
I've got a couple of ideas on how to scale this, but they are a bit hacky, so I was hoping that somebody more experienced could pitch in.
Keep the entire script in python. Used Dask to distribute the processing of various groups of customers in parallel across the cluster.
Keep the entire script in python, but run that using pyspark, keeping a pandas UDF as an entry/exit point for various python functions. However, since pandas UDF have certain limitations in that I can only input & output a single dataframe, but my analysis requires multiple datasets, I need to have some workarounds. Here's what I've what figured out:
Read all datasets into pyspark. All relevant datasets have same number of rows, indexed with customer and other attributes, so I'll concat each row of a dataset into a single column array. So, basically, the 3-4 datasets become 3-4 columns in a consolidated dataset + a customer index.
Transfer this across to python via a pandas UDF.
Extract all relevant datasets from this combined structure in python, perform all the operations (around 1000 loc) and resemble the outputs into a similar structure as the input and transfer back to pyspark.
Since I used a pandas UDF computations across all groups should have occurred in parallel. This then becomes akin to running a Dask like distributed compute via pyspark.
Extract all the data from this consolidated array, map types, and save via pyspark.
This is extremely hacky, and has a few downsides, but I think it'll do the job. I realize that I won't really be able to debug the python udf code easily, so that'll be an irritant, and the solution is still fundamentally limited by the size of the largest single executor I can get, but despite that it'll likely perform better native pyspark/scala code.
Any suggestions on how to better structure this, or ideas about how to do more rapid linear algebra on pyspark natively would be greatly appreciated.

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

Is it inefficient to use a UDF to calculate the distance between two vectors?

I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
"You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
"UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours.
Do it in scala if you have to do it.
Maybe that can help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Using RDD transformation and converts it to a Dataset before an action VS using Dataset and its API

Consider the two scenarios:
A) If I have a RDD and various RDD transformations are called on it, and before any actions are done I create a Dataset from it.
B) I create a Dataset at the very beginning and calls various Dataset methods on it.
Question: If the two scenarios produce the same outcome logically - one uses RDD transformation and converts it to a Dataset right before an action vs just using Dataset and its transformation - do both scenarios goes through the same optimizations?
No they do not.
When you do RDD and RDD transformation on them, no optimization is done. When you transform it to dataset in the end, then and only then conversion to tungsten based representation (which takes less memory and doesn't need to go through garbage collection) is performed.
When you use dataset from the beginning then it will use the tungsten based memory representation from the beginning. This means it will take less memory, shuffles will be smaller and faster and no GC overhead would occur (although conversion from internal representation to case class and back would occur any time typed operations are used). If you use dataframe operations on the dataset then it may also take advantage of code gen and catalyst optimizations.
See also my answer in: Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?
They don't. RDD API doesn't use any of the Tungsten / Catalyst optimizations and equivalent logic is not relevant.

Get Spark metrics on each iteration step?

Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.

Resources