Should I Avoid groupby() in Dataset/Dataframe? [duplicate] - apache-spark

This question already has an answer here:
DataFrame / Dataset groupBy behaviour/optimization
(1 answer)
Closed 5 years ago.
I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.
Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.

The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.
There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.
And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames

Related

How to better structure linear algebra-heavy code in PySpark?

Need some suggestions on scaling a pipeline on spark that performs Collaborative Filtering for about 200k-1m people, but does so in groups, with the largest group being approx. 40-50k customers at best. In addition to Collaborative Filtering, which is reasonably fast with ALS, there's a lot of linear algebra that occurs that I couldn't really figure out how to perform with the spark Dataframe API, and had to drop down to the RDD API to perform, and that leads to a significant loss in performance. I've currently got multiple variations of this script - in scala, pyspark, and python - and by far the fastest, despite not being distributor/parallelized, is python, where I'm using numpy for all linear algebra tasks, and python for the remaining transformations.
So, to summarize, I've got a pipeline with a lot of complicated linear algebra that spark doesn't seem to have performant native data structures for, and the workarounds I've devised - RDDs level manipulations for most operations, parallelizing and broadcasting the RDDs to perform matmul in chunks, etc - are significantly slower than just performing the operations in-memory on numpy.
I've got a couple of ideas on how to scale this, but they are a bit hacky, so I was hoping that somebody more experienced could pitch in.
Keep the entire script in python. Used Dask to distribute the processing of various groups of customers in parallel across the cluster.
Keep the entire script in python, but run that using pyspark, keeping a pandas UDF as an entry/exit point for various python functions. However, since pandas UDF have certain limitations in that I can only input & output a single dataframe, but my analysis requires multiple datasets, I need to have some workarounds. Here's what I've what figured out:
Read all datasets into pyspark. All relevant datasets have same number of rows, indexed with customer and other attributes, so I'll concat each row of a dataset into a single column array. So, basically, the 3-4 datasets become 3-4 columns in a consolidated dataset + a customer index.
Transfer this across to python via a pandas UDF.
Extract all relevant datasets from this combined structure in python, perform all the operations (around 1000 loc) and resemble the outputs into a similar structure as the input and transfer back to pyspark.
Since I used a pandas UDF computations across all groups should have occurred in parallel. This then becomes akin to running a Dask like distributed compute via pyspark.
Extract all the data from this consolidated array, map types, and save via pyspark.
This is extremely hacky, and has a few downsides, but I think it'll do the job. I realize that I won't really be able to debug the python udf code easily, so that'll be an irritant, and the solution is still fundamentally limited by the size of the largest single executor I can get, but despite that it'll likely perform better native pyspark/scala code.
Any suggestions on how to better structure this, or ideas about how to do more rapid linear algebra on pyspark natively would be greatly appreciated.

Is it inefficient to use a UDF to calculate the distance between two vectors?

I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
"You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
"UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours.
Do it in scala if you have to do it.
Maybe that can help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Percentiles in spark - most efficient method (RDD vs SqlContext) [duplicate]

This question already has answers here:
How to find median and quantiles using Spark
(8 answers)
Closed 4 years ago.
I have a large grouped dataset in spark that I need to return the percentiles from 0.01 to 0.99 for.
I have been using online resources to determine different methods of doing this, from operations on RDD:
How to compute percentiles in Apache Spark
To SQLContext functionality:
Calculate quantile on grouped data in spark Dataframe
My question is does anyone have any opinion on what the most efficient approach is?
Also as a bonus, in SQLContext there is functions for both percentile_approx and percentile. There isn't much documentation available online for 'percentile' is this just a non-approximated 'percentile_approx' function?
Dataframes will be more efficient in general. Read this for details on the reasons - https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html.
There are a few benchmarks out there as well. For example this one claims that "the new DataFrame API is faster than the RDD API for simple grouping and aggregations".
You can look up Hive's documentation to figure out difference between percentile and percentile_approx.

Using RDD transformation and converts it to a Dataset before an action VS using Dataset and its API

Consider the two scenarios:
A) If I have a RDD and various RDD transformations are called on it, and before any actions are done I create a Dataset from it.
B) I create a Dataset at the very beginning and calls various Dataset methods on it.
Question: If the two scenarios produce the same outcome logically - one uses RDD transformation and converts it to a Dataset right before an action vs just using Dataset and its transformation - do both scenarios goes through the same optimizations?
No they do not.
When you do RDD and RDD transformation on them, no optimization is done. When you transform it to dataset in the end, then and only then conversion to tungsten based representation (which takes less memory and doesn't need to go through garbage collection) is performed.
When you use dataset from the beginning then it will use the tungsten based memory representation from the beginning. This means it will take less memory, shuffles will be smaller and faster and no GC overhead would occur (although conversion from internal representation to case class and back would occur any time typed operations are used). If you use dataframe operations on the dataset then it may also take advantage of code gen and catalyst optimizations.
See also my answer in: Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?
They don't. RDD API doesn't use any of the Tungsten / Catalyst optimizations and equivalent logic is not relevant.

Resources