Need some suggestions on scaling a pipeline on spark that performs Collaborative Filtering for about 200k-1m people, but does so in groups, with the largest group being approx. 40-50k customers at best. In addition to Collaborative Filtering, which is reasonably fast with ALS, there's a lot of linear algebra that occurs that I couldn't really figure out how to perform with the spark Dataframe API, and had to drop down to the RDD API to perform, and that leads to a significant loss in performance. I've currently got multiple variations of this script - in scala, pyspark, and python - and by far the fastest, despite not being distributor/parallelized, is python, where I'm using numpy for all linear algebra tasks, and python for the remaining transformations.
So, to summarize, I've got a pipeline with a lot of complicated linear algebra that spark doesn't seem to have performant native data structures for, and the workarounds I've devised - RDDs level manipulations for most operations, parallelizing and broadcasting the RDDs to perform matmul in chunks, etc - are significantly slower than just performing the operations in-memory on numpy.
I've got a couple of ideas on how to scale this, but they are a bit hacky, so I was hoping that somebody more experienced could pitch in.
Keep the entire script in python. Used Dask to distribute the processing of various groups of customers in parallel across the cluster.
Keep the entire script in python, but run that using pyspark, keeping a pandas UDF as an entry/exit point for various python functions. However, since pandas UDF have certain limitations in that I can only input & output a single dataframe, but my analysis requires multiple datasets, I need to have some workarounds. Here's what I've what figured out:
Read all datasets into pyspark. All relevant datasets have same number of rows, indexed with customer and other attributes, so I'll concat each row of a dataset into a single column array. So, basically, the 3-4 datasets become 3-4 columns in a consolidated dataset + a customer index.
Transfer this across to python via a pandas UDF.
Extract all relevant datasets from this combined structure in python, perform all the operations (around 1000 loc) and resemble the outputs into a similar structure as the input and transfer back to pyspark.
Since I used a pandas UDF computations across all groups should have occurred in parallel. This then becomes akin to running a Dask like distributed compute via pyspark.
Extract all the data from this consolidated array, map types, and save via pyspark.
This is extremely hacky, and has a few downsides, but I think it'll do the job. I realize that I won't really be able to debug the python udf code easily, so that'll be an irritant, and the solution is still fundamentally limited by the size of the largest single executor I can get, but despite that it'll likely perform better native pyspark/scala code.
Any suggestions on how to better structure this, or ideas about how to do more rapid linear algebra on pyspark natively would be greatly appreciated.
Related
We have a use case for doing a large number of vector multiplications and summing the results such that the input data typically will not fit into the RAM of a single host, even if using 0.5 TB RAM EC2 instances (fitting OLS regression models). Therefore we would like to:
Leverage PySpark for Spark's traditional capabilities (distributing the data, handling worker failures transparently, etc.)
But also leverage C/C++-based numerical computing for doing the actual math on the workers
The leading path seems to be to leverage Apache Arrow with PySpark, and use Pandas functions backed by NumPy (in turn written in C) for the vector products. However, I would like to load the data directly to Arrow format on Spark workers. The existing PySpark/Pandas/Arrow documentation seems to imply that the data is in fact loaded into Spark's internal representation first, then converted into Arrow when Pandas UDFs are called: https://spark.apache.org/docs/3.0.1/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size
I found one related paper to this in which the authors developed a zero-copy Arrow-based interface for Spark, so I take it that this is a highly custom thing to do not currently supported in Spark: https://users.soe.ucsc.edu/~carlosm/dev/publication/rodriguez-arxiv-21/rodriguez-arxiv-21.pdf
I would like to ask if anyone knows of a simple way other than what is described in this paper. Thank you!
I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
"You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
"UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours.
Do it in scala if you have to do it.
Maybe that can help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.
Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.
Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.
Is there a way to shuffle data before splitting it?
I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.
And as stated by #zero323 Mllib simply takes a random sample by traversing each partition.
Is there a way to shuffle data before splitting it?
It depends on a context. You can always repartition or sort by random value but it is
Expensive
Requires some effort to avoid caching if you want to get different result each time
It is harder to get reproducible sample if you need one.
Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.
I hope this helps.