Decomposing Spark RDDs - apache-spark

In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.

I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.

Related

How to better structure linear algebra-heavy code in PySpark?

Need some suggestions on scaling a pipeline on spark that performs Collaborative Filtering for about 200k-1m people, but does so in groups, with the largest group being approx. 40-50k customers at best. In addition to Collaborative Filtering, which is reasonably fast with ALS, there's a lot of linear algebra that occurs that I couldn't really figure out how to perform with the spark Dataframe API, and had to drop down to the RDD API to perform, and that leads to a significant loss in performance. I've currently got multiple variations of this script - in scala, pyspark, and python - and by far the fastest, despite not being distributor/parallelized, is python, where I'm using numpy for all linear algebra tasks, and python for the remaining transformations.
So, to summarize, I've got a pipeline with a lot of complicated linear algebra that spark doesn't seem to have performant native data structures for, and the workarounds I've devised - RDDs level manipulations for most operations, parallelizing and broadcasting the RDDs to perform matmul in chunks, etc - are significantly slower than just performing the operations in-memory on numpy.
I've got a couple of ideas on how to scale this, but they are a bit hacky, so I was hoping that somebody more experienced could pitch in.
Keep the entire script in python. Used Dask to distribute the processing of various groups of customers in parallel across the cluster.
Keep the entire script in python, but run that using pyspark, keeping a pandas UDF as an entry/exit point for various python functions. However, since pandas UDF have certain limitations in that I can only input & output a single dataframe, but my analysis requires multiple datasets, I need to have some workarounds. Here's what I've what figured out:
Read all datasets into pyspark. All relevant datasets have same number of rows, indexed with customer and other attributes, so I'll concat each row of a dataset into a single column array. So, basically, the 3-4 datasets become 3-4 columns in a consolidated dataset + a customer index.
Transfer this across to python via a pandas UDF.
Extract all relevant datasets from this combined structure in python, perform all the operations (around 1000 loc) and resemble the outputs into a similar structure as the input and transfer back to pyspark.
Since I used a pandas UDF computations across all groups should have occurred in parallel. This then becomes akin to running a Dask like distributed compute via pyspark.
Extract all the data from this consolidated array, map types, and save via pyspark.
This is extremely hacky, and has a few downsides, but I think it'll do the job. I realize that I won't really be able to debug the python udf code easily, so that'll be an irritant, and the solution is still fundamentally limited by the size of the largest single executor I can get, but despite that it'll likely perform better native pyspark/scala code.
Any suggestions on how to better structure this, or ideas about how to do more rapid linear algebra on pyspark natively would be greatly appreciated.

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

How to make randomsplit generate ordered items in across all splits

I am developing a prediction model on time-series data. I am using trainvalidationsplit in Spark to train and validate my model before testing it on an unforseen data.
Actually, in the validation phase, I need to have an ordered data based on my timestaps in input RDD (considering my RDD[timesoatmp, text]). I know trainvalidationsplit use randomsplit method which shuffle my training data. But I need to find a way when giving my trainigndata to trainvalidationsplit , and when it divided data for training and validation, the training and validation should be ordered by timestamp.
I wanted to know if is there any way to make radnomsplit generate elements across splitetd RDDs to be ordered in the follwing way. For example my RDD is (1,3,4,5,7,8), the rdd.randomsplit (0.5,0.5) should generate first RDD as (1,3,2) and the second one as (7,5,8)..the orders in each split is not important but overall the IDs (timestamps) in first split should be less than in second split.

Using RDD transformation and converts it to a Dataset before an action VS using Dataset and its API

Consider the two scenarios:
A) If I have a RDD and various RDD transformations are called on it, and before any actions are done I create a Dataset from it.
B) I create a Dataset at the very beginning and calls various Dataset methods on it.
Question: If the two scenarios produce the same outcome logically - one uses RDD transformation and converts it to a Dataset right before an action vs just using Dataset and its transformation - do both scenarios goes through the same optimizations?
No they do not.
When you do RDD and RDD transformation on them, no optimization is done. When you transform it to dataset in the end, then and only then conversion to tungsten based representation (which takes less memory and doesn't need to go through garbage collection) is performed.
When you use dataset from the beginning then it will use the tungsten based memory representation from the beginning. This means it will take less memory, shuffles will be smaller and faster and no GC overhead would occur (although conversion from internal representation to case class and back would occur any time typed operations are used). If you use dataframe operations on the dataset then it may also take advantage of code gen and catalyst optimizations.
See also my answer in: Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?
They don't. RDD API doesn't use any of the Tungsten / Catalyst optimizations and equivalent logic is not relevant.

Spark mllib shuffling the data

Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.
Is there a way to shuffle data before splitting it?
I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.
And as stated by #zero323 Mllib simply takes a random sample by traversing each partition.
Is there a way to shuffle data before splitting it?
It depends on a context. You can always repartition or sort by random value but it is
Expensive
Requires some effort to avoid caching if you want to get different result each time
It is harder to get reproducible sample if you need one.
Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.
I hope this helps.

Resources