Spark SQL with different data sources - apache-spark

Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.

Related

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Real difference between RDD and DataFrame/Dataset [duplicate]

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark?
Can you convert one to the other?
First thing is DataFrame was evolved from SchemaRDD.
Yes.. conversion between Dataframe and RDD is absolutely possible.
Below are some sample code snippets.
df.rdd is RDD[Row]
Below are some of options to create dataframe.
1) yourrddOffrow.toDF converts to DataFrame.
2) Using createDataFrame of sql context
val df = spark.createDataFrame(rddOfRow, schema)
where schema can be from some of below options as described by nice SO post..
From scala case class and scala reflection api
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType]
OR using Encoders
import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema
as described by Schema can also be created using StructType and
StructField
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("col1", DoubleType, true))
.add(StructField("col2", DoubleType, true)) etc...
In fact there Are Now 3 Apache Spark APIs..
RDD API :
The RDD (Resilient Distributed Dataset) API has been in Spark since
the 1.0 release.
The RDD API provides many transformation methods, such as map(),
filter(), and reduce() for performing computations on the data. Each
of these methods results in a new RDD representing the transformed
data. However, these methods are just defining the operations to be
performed and the transformations are not performed until an action
method is called. Examples of action methods are collect() and
saveAsObjectFile().
RDD Example:
rdd.filter(_.age > 21) // transformation
.map(_.last)// transformation
.saveAsObjectFile("under21.bin") // action
Example: Filter by attribute with RDD
rdd.filter(_.age > 21)
DataFrame API
Spark 1.3 introduced a new DataFrame API as part of the Project
Tungsten initiative which seeks to improve the performance and
scalability of Spark. The DataFrame API introduces the concept of a
schema to describe the data, allowing Spark to manage the schema and
only pass data between nodes, in a much more efficient way than using
Java serialization.
The DataFrame API is radically different from the RDD API because it
is an API for building a relational query plan that Spark’s Catalyst
optimizer can then execute. The API is natural for developers who are
familiar with building query plans
Example SQL style :
df.filter("age > 21");
Limitations :
Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited.
For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case class works out the box because they implement this interface.
Dataset API
The Dataset API, released as an API preview in Spark 1.6, aims to
provide the best of both worlds; the familiar object-oriented
programming style and compile-time type-safety of the RDD API but with
the performance benefits of the Catalyst query optimizer. Datasets
also use the same efficient off-heap storage mechanism as the
DataFrame API.
When it comes to serializing data, the Dataset API has the concept of
encoders which translate between JVM representations (objects) and
Spark’s internal binary format. Spark has built-in encoders which are
very advanced in that they generate byte code to interact with
off-heap data and provide on-demand access to individual attributes
without having to de-serialize an entire object. Spark does not yet
provide an API for implementing custom encoders, but that is planned
for a future release.
Additionally, the Dataset API is designed to work equally well with
both Java and Scala. When working with Java objects, it is important
that they are fully bean-compliant.
Example Dataset API SQL style :
dataset.filter(_.age < 21);
Evaluations diff. between DataFrame & DataSet :
Catalist level flow..(Demystifying DataFrame and Dataset presentation from spark summit)
Further reading... databricks article - A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets
A DataFrame is defined well with a google search for "DataFrame definition":
A data frame is a table, or two-dimensional array-like structure, in
which each column contains measurements on one variable, and each row
contains one case.
So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method
In general it is recommended to use a DataFrame where possible due to the built in query optimization.
Apache Spark provide three type of APIs
RDD
DataFrame
Dataset
Here is the APIs comparison between RDD, Dataframe and Dataset.
RDD
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
RDD Features:-
Distributed collection:
RDD uses MapReduce operations which is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
Immutable: RDDs composed of a collection of records which are partitioned. A partition is a basic unit of parallelism in an RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Fault tolerant:
In a case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is the biggest benefit of RDD because it saves a lot of efforts in data management and replication and thus achieves faster computations.
Lazy evaluations: All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset . The transformations are only computed when an action requires a result to be returned to the driver program.
Functional transformations:
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Data processing formats:
It can easily and efficiently process data which is structured as well as unstructured data.
Programming Languages supported:
RDD API is available in Java, Scala, Python and R.
RDD Limitations:-
No inbuilt optimization engine:
When working with structured data, RDDs cannot take advantages of Spark’s advanced optimizers including catalyst optimizer and Tungsten execution engine. Developers need to optimize each RDD based on its attributes.
Handling structured data:
Unlike Dataframe and datasets, RDDs don’t infer the schema of the ingested data and requires the user to specify it.
Dataframes
Spark introduced Dataframes in Spark 1.3 release. Dataframe overcomes the key challenges that RDDs had.
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a R/Python Dataframe. Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.
Dataframe Features:-
Distributed collection of Row Object:
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, but with richer optimizations under the hood.
Data Processing:
Processing structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL, etc). It can read and write from all these various datasources.
Optimization using catalyst optimizer:
It powers both SQL queries and the DataFrame API. Dataframe use catalyst tree transformation framework in four phases,
1.Analyzing a logical plan to resolve references
2.Logical plan optimization
3.Physical planning
4.Code generation to compile parts of the query to Java bytecode.
Hive Compatibility:
Using Spark SQL, you can run unmodified Hive queries on your existing Hive warehouses. It reuses Hive frontend and MetaStore and gives you full compatibility with existing Hive data, queries, and UDFs.
Tungsten:
Tungsten provides a physical execution backend whichexplicitly manages memory and dynamically generates bytecode for expression evaluation.
Programming Languages supported:
Dataframe API is available in Java, Scala, Python, and R.
Dataframe Limitations:-
Compile-time type safety:
As discussed, Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not know. The following example works during compile time. However, you will get a Runtime exception when executing this code.
Example:
case class Person(name : String , age : Int)
val dataframe = sqlContext.read.json("people.json")
dataframe.filter("salary > 10000").show
=> throws Exception : cannot resolve 'salary' given input age , name
This is challenging specially when you are working with several transformation and aggregation steps.
Cannot operate on domain Object (lost domain object):
Once you have transformed a domain object into dataframe, you cannot regenerate it from it. In the following example, once we have create personDF from personRDD, we won’t be recover the original RDD of Person class (RDD[Person]).
Example:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContext.createDataframe(personRDD)
personDF.rdd // returns RDD[Row] , does not returns RDD[Person]
Datasets API
Dataset API is an extension to DataFrames that provides a type-safe, object-oriented programming interface. It is a strongly-typed, immutable collection of objects that are mapped to a relational schema.
At the core of the Dataset, API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Spark 1.6 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.
Dataset Features:-
Provides best of both RDD and Dataframe:
RDD(functional programming, type safe), DataFrame (relational model, Query optimazation , Tungsten execution, sorting and shuffling)
Encoders:
With the use of Encoders, it is easy to convert any JVM object into a Dataset, allowing users to work with both structured and unstructured data unlike Dataframe.
Programming Languages supported:
Datasets API is currently only available in Scala and Java. Python and R are currently not supported in version 1.6. Python support is slated for version 2.0.
Type Safety:
Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions.
Example:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContext.createDataframe(personRDD)
val ds:Dataset[Person] = personDF.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 25)
// error : value salary is not a member of person
ds.rdd // returns RDD[Person]
Interoperable: Datasets allows you to easily convert your existing RDDs and Dataframes into datasets without boilerplate code.
Datasets API Limitation:-
Requires type casting to String:
Querying the data from datasets currently requires us to specify the fields in the class as a string. Once we have queried the data, we are forced to cast column to the required data type. On the other hand, if we use map operation on Datasets, it will not use Catalyst optimizer.
Example:
ds.select(col("name").as[String], $"age".as[Int]).collect()
No support for Python and R: As of release 1.6, Datasets only support Scala and Java. Python support will be introduced in Spark 2.0.
The Datasets API brings in several advantages over the existing RDD and Dataframe API with better type safety and functional programming.With the challenge of type casting requirements in the API, you would still not the required type safety and will make your code brittle.
All(RDD, DataFrame, and DataSet) in one picture.
image credits
RDD
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
DataFrame
DataFrame is a Dataset organized into named columns. It is
conceptually equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations under the hood.
Dataset
Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs
(strong typing, ability to use powerful lambda functions) with the
benefits of Spark SQL’s optimized execution engine.
Note:
Dataset of Rows (Dataset[Row]) in Scala/Java will often refer as DataFrames.
Nice comparison of all of them with a code snippet.
source
Q: Can you convert one to the other like RDD to DataFrame or vice-versa?
Yes, both are possible
1. RDD to DataFrame with .toDF()
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
more ways: Convert an RDD object to Dataframe in Spark
2. DataFrame/DataSet to RDD with .rdd() method
val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD
Because DataFrame is weakly typed and developers aren't getting the benefits of the type system. For example, lets say you want to read something from SQL and run some aggregation on it:
val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
When you say people("deptId"), you're not getting back an Int, or a Long, you're getting back a Column object which you need to operate on. In languages with a rich type systems such as Scala, you end up losing all the type safety which increases the number of run-time errors for things that could be discovered at compile time.
On the contrary, DataSet[T] is typed. when you do:
val people: People = val people = sqlContext.read.parquet("...").as[People]
You're actually getting back a People object, where deptId is an actual integral type and not a column type, thus taking advantage of the type system.
As of Spark 2.0, the DataFrame and DataSet APIs will be unified, where DataFrame will be a type alias for DataSet[Row].
Simply RDD is core component, but DataFrame is an API introduced in spark 1.30.
RDD
Collection of data partitions called RDD. These RDD must follow few properties such is:
Immutable,
Fault Tolerant,
Distributed,
More.
Here RDD is either structured or unstructured.
DataFrame
DataFrame is an API available in Scala, Java, Python and R. It allows to process any type of Structured and semi structured data. To define DataFrame, a collection of distributed data organized into named columns called DataFrame. You can easily optimize the RDDs in the DataFrame.
You can process JSON data, parquet data, HiveQL data at a time by using DataFrame.
val sampleRDD = sqlContext.jsonFile("hdfs://localhost:9000/jsondata.json")
val sample_DF = sampleRDD.toDF()
Here Sample_DF consider as DataFrame. sampleRDD is (raw data) called RDD.
Most of answers are correct only want to add one point here
In Spark 2.0 the two APIs (DataFrame +DataSet) will be unified together into a single API.
"Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface."
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.
Here you can find RDD tof Data frame conversation answer
How to convert rdd object to dataframe in spark
A DataFrame is equivalent to a table in RDBMS and can also be manipulated in similar ways to the "native" distributed collections in RDDs. Unlike RDDs, Dataframes keep track of the schema and support various relational operations that lead to more optimized execution.
Each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation".
Few insights from usage perspective, RDD vs DataFrame:
RDDs are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured data. As, lot of times data is not ready to be fit into a DataFrame, (even JSON), RDDs can be used to do preprocessing on the data so that it can fit in a dataframe. RDDs are core data abstraction in Spark.
Not all transformations that are possible on RDD are possible on DataFrames, example subtract() is for RDD vs except() is for DataFrame.
Since DataFrames are like a relational table, they follow strict rules when using set/relational theory transformations, for example if you wanted to union two dataframes the requirement is that both dfs have same number of columns and associated column datatypes. Column names can be different. These rules don't apply to RDDs. Here is a good tutorial explaining these facts.
There are performance gains when using DataFrames as others have already explained in depth.
Using DataFrames you don't need to pass the arbitrary function as you do when programming with RDDs.
You need the SQLContext/HiveContext to program dataframes as they lie in SparkSQL area of spark eco-system, but for RDD you only need SparkContext/JavaSparkContext which lie in Spark Core libraries.
You can create a df from a RDD if you can define a schema for it.
You can also convert a df to rdd and rdd to df.
I hope it helps!
A Dataframe is an RDD of Row objects, each representing a record. A
Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes
look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.
Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)
a. RDD (Spark1.0) —> Dataframe(Spark1.3) —> Dataset(Spark1.6)
b. RDD lets us decide HOW we want to do which limits the optimization Spark can do on processing underneath . dataframe/dataset lets us decide WHAT we want to do and leave everything on Spark to decide how to do computation.
c. RDD Being in-memory jvm objects, RDDs involve overhead of Garbage Collection and Java(or little better Kryo) Serialization which are expensive when data grows. That is degrade the performance.
Data frame offers huge performance improvement over RDDs because of 2 powerful features it has:
Custom Memory management (aka Project Tungsten)
Optimized Execution Plans (aka Catalyst Optimizer)
Performance wise RDD -> Data Frame -> Dataset
d. How dataset(Project Tungsten and Catalyst Optimizer) scores over Data frame is an additional feature it has: Encoders
Spark RDD (resilient distributed dataset) :
RDD is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner.
To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called
Spark DataFrame :
Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection.
Spark DataSet :
As an extension to the DataFrame APIs, Spark 1.3 also introduced DataSet APIs which provides strictly typed and object-oriented programming interface in Spark. It is immutable, type-safe collection of distributed data. Like DataFrame, DataSet APIs also uses Catalyst engine in order to enable execution optimization. DataSet is an extension to the DataFrame APIs.
Other Differences -
A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations.
Note that DataFrame was called SchemaRDD before Spark v1.3.0
Apache Spark – RDD, DataFrame, and DataSet
Spark RDD –
An RDD stands for Resilient Distributed Datasets. It is Read-only
partition collection of records. RDD is the fundamental data structure
of Spark. It allows a programmer to perform in-memory computations on
large clusters in a fault-tolerant manner. Thus, speed up the task.
Spark Dataframe –
Unlike an RDD, data organized into named columns. For example a table
in a relational database. It is an immutable distributed collection of
data. DataFrame in Spark allows developers to impose a structure onto
a distributed collection of data, allowing higher-level abstraction.
Spark Dataset –
Datasets in Apache Spark are an extension of DataFrame API which
provides type-safe, object-oriented programming interface. Dataset
takes advantage of Spark’s Catalyst optimizer by exposing expressions
and data fields to a query planner.

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Resources