Relation between RDD and Dataset/Dataframe from a technical point of view - apache-spark

I am trying to understand if there is a relationship between RDDs and Dataframes/Datesets from a technical point of view. RDDs are often described as the fundamental data abstraction in Spark. In my understanding this would mean that Dataframes/Datasets should also be based on it. In the original Spark SQL Paper the figures 1 & 3 point to this connection. However, I haven't found any documentation on how this connection looks like (if it exists at all).
So my question: Are Dataframes/Datasets based on RDDs or are these two concepts independent?

Dataframe and Datasets are based on the Rdd, however this is a little bit hidden. The fact is that Dataframe and Datasets are more used on the spark-sql project, where as Rdd are on the spark-core.
Here is the technical point of view on how Dataframe, which is Dataset[Row], and Rdd are linked: Dataframe has a QueryExecution which controls how all the sql execution acts. Now when this get executed by the engine it will be output in an internal rdd of type Row, lazy val toRdd: RDD[InternalRow] = executedPlan.execute(). Having that rdd and a schema it will form a Dataframe.

Related

How to implement Simrank efficiently using Spark RDD?

I want to implement SimRank using spark rdd interface. But my dataset is too large to process that the bipartite graph has hundreds of millions of nodes, so to find the similarity score of all neighborhood pairs is computationally expensive. I try to find some existing implementations but they all seems not to be scalable. Any suggestions?
I suggest to first take a look on the GraphX and Graphframes libraries that comes with the Apache Spark ecosystem and see if those fits your needs. They mostly bring in graph processing support on the top of RDD and Dataframes.

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

How to name a DataFrame in Spark to make the DAG diagram easier to read?

In Spark, the DAG diagram can get quite complicated after a few joins, for example:
Is there a way to make it more understandable by, first, naming Spark datasets, and seconde, tagging each stage with the dataset that it computes (or help to compute) so that we could trace the stages back to the code?
You can name the rdd as
yourrdd.setName(“ABCD”)

What a data structure is DataFrame in Spark?

This is a follow-up to my previous question.
Row is an ordered set of key value pairs. DataFrame is a collection of Rows.
What a data structure is DataFrame actually ? Is it a list, set, or other "collection" ? Is it a relation as in SQL ?
It's an abstraction over a RDD[Row], or Dataset[Row] in Spark2, with a defined schema set through a series Column classes
Is it a list, set, or other "collection" ?
Not in Java terms of those words. Similar to how RDD is none of those, but rather a "lazy collection"
Is it a relation as in SQL ?
You're welcome to run SparkSQL over a Dataframe, but it's a table. Relations are optional
Although Dataframe is an abstraction over RDD, the internal representation of Dataframe is quite different than RDD.
RDD is represented as a JAVA objects and uses JVM for all operations. However Dataframe is represented in tungsten.
Here is an excellent article which elaborate how dataframes are represented in tungsten.

Decomposing Spark RDDs

In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.
I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.

Resources