What a data structure is DataFrame in Spark? - apache-spark

This is a follow-up to my previous question.
Row is an ordered set of key value pairs. DataFrame is a collection of Rows.
What a data structure is DataFrame actually ? Is it a list, set, or other "collection" ? Is it a relation as in SQL ?

It's an abstraction over a RDD[Row], or Dataset[Row] in Spark2, with a defined schema set through a series Column classes
Is it a list, set, or other "collection" ?
Not in Java terms of those words. Similar to how RDD is none of those, but rather a "lazy collection"
Is it a relation as in SQL ?
You're welcome to run SparkSQL over a Dataframe, but it's a table. Relations are optional

Although Dataframe is an abstraction over RDD, the internal representation of Dataframe is quite different than RDD.
RDD is represented as a JAVA objects and uses JVM for all operations. However Dataframe is represented in tungsten.
Here is an excellent article which elaborate how dataframes are represented in tungsten.

Related

Apache spark : How deep is the comparison of rows in RDD or DF

I want to understand the behavior of DF.intersect().
so the question came to mind, especially when we have complex Rows having complex fields. (deep tree)
If we are talking about dataframe intersect transformation, then, according to the Dataset documentation and source, the comparison is done directly on the encoded content. Which is as deep as it can possibly go.
def intersect(other: Dataset[T]): Dataset[T] Returns a new Dataset
containing rows only in both this Dataset and another Dataset. This is
equivalent to INTERSECT in SQL.
Since
1.6.0
Note: Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom equals
function defined on T.

Relation between RDD and Dataset/Dataframe from a technical point of view

I am trying to understand if there is a relationship between RDDs and Dataframes/Datesets from a technical point of view. RDDs are often described as the fundamental data abstraction in Spark. In my understanding this would mean that Dataframes/Datasets should also be based on it. In the original Spark SQL Paper the figures 1 & 3 point to this connection. However, I haven't found any documentation on how this connection looks like (if it exists at all).
So my question: Are Dataframes/Datasets based on RDDs or are these two concepts independent?
Dataframe and Datasets are based on the Rdd, however this is a little bit hidden. The fact is that Dataframe and Datasets are more used on the spark-sql project, where as Rdd are on the spark-core.
Here is the technical point of view on how Dataframe, which is Dataset[Row], and Rdd are linked: Dataframe has a QueryExecution which controls how all the sql execution acts. Now when this get executed by the engine it will be output in an internal rdd of type Row, lazy val toRdd: RDD[InternalRow] = executedPlan.execute(). Having that rdd and a schema it will form a Dataframe.

Apache Spark MLlib ALS. Duplicate user-item pairs

I am using Spark MLlib ALS function to build a recommendation system. The function accepts as input an rdd comprising rows of the form: (user_id, item_id, rating).
I would like to know what happens when the function sees two tuples with the same user_id and item_id. Is the function overwriting or averaging the values?
I went through the official documentation but did not find any clue.
Many thanks

How does spark handle missing values?

Apache spark support sparse data.
For example, we can use MLUtils.loadLibSVMFile(...) to load data into an RDD.
I was wondering how does spark deal with those missing values.
Spark creates an RDD of Labeled points, and each labeled point has a label and a vector of features. Note that this is a Spark Vector which does support sparse elements (currently Sparse vectors are represented by an array of non-indices and a second array of doubles for each of the non-null value).

Decomposing Spark RDDs

In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.
I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.

Resources