This question already exists:
Difference between DataSet API and DataFrame API [duplicate]
Closed 4 years ago.
My understanding that one of the big changes between Spark 1.x and 2.x was the migration away from DataFrames to the adoption of newer/improved Dataset objects.
However in all the Spark 2.x docs I see DataFrames being used, not Datasets.
So I ask: In Spark 2.x are we still using DataFrames, or have the Spark folks just not updated there 2.x docs to use the newer + recommended Datasets?
DataFrames ARE Datasets, just a special type of Datasets, namely Dataset[Row], meaning untyped Datasets.
But it's true that even with Spark 2.x, many Spark users still use DataFrames, especially for fast prototyping (I'm one of them), because it's a very convenient API and many operations are (in my view) easier to do with DataFrames than with Datasets
Apparently you can use both but no one over at Spark has bothered updating the docs to show how to use Datasets so I'm guessing they really want us to just use DataFrames like we did in 1.x.
Related
I have a notebook in databricks where I only have SQL queries, I want to know if it's better (talking about performance) to switch all of them to pyspark or if it would be the same.
In other words I want to know if databricks-sql uses spark-sql to execute the queries.
I found this question (looks pretty similar to mine), but the answer is not what I want to know.
Yes, you can definitely use PySpark in place of SQL.
The decision mostly depends on the type of data store. If your data is stored in database then SQL is the best option. If you are working with DataFrames, then PySpark is the good options as it gives you more flexibility and features with supported libraries.
It uses SparkSQL and DataFrame APIs.
Dataframe uses tungsten memory representation , catalyst optimizer used by SQL as well as DataFrame. With Dataset API, you have more control on the actual execution plan than with SparkSQL.
Refer PySpark for more details and better understanding.
I started reading the book called "Spark definitive guide-big data processing made simple" to learn Spark. While I was reading I saw a line saying "A DataFrame is the most common Structured API and simply represents a table of data with rows and columns." I am not able to understand why are RDDs and DataFrames being called APIs?
They're called APIs because they're essentially just different interfaces to exactly the same data. DataFrame can be built on top of RDD and RDD can be extracted from DataFrame. They just have different sets of functions defined on that data, main differences are semantics and the way you work with data, RDD being lower level API and DataFrame being higher level API. For example you can use Spark SQL interface with DataFrame which provides all common SQL functions, but if you decide to use RDDs, you would need to write SQL functions yourself using RDD transformations.
And of course, they both exist because it really comes down to your use case.
This question already has answers here:
Spark sql queries vs dataframe functions
(4 answers)
Closed 3 years ago.
I am working in spark for last 6 + months. I have seen people coming from Data warehousing and SQL backgrounds are implementing aggregations and other transformation logic in SQL using
spark.sql()
(where spark is the sparkSession object)
directly over hive tables or after registering a Dataframe as a TempView using
dataframe.createOrReplaceTempView().
But if we see we have also other options like windows functions or alternatives which can be implemented directly over dataframes.Or even we can register a function as UDF and can be implemented over dataframe.
Say If I need to implement count of population group by City over a dataframe CITY_CENSUS I can implement in either of the below methods
using spark.sql():
CITY_CENSUS.createOrReplaceTempView("CITY_CENSUS")
spark.sql("select city,count(population) from CITY_CENSUS group by city")
OR using aggregation directly over dataframe:
CITY_CENSUS.groupBy("city").agg(count("population"))
Like this we have so many examples.
Is there any performance benefit in using the dataframe approach over spark.sql() or viceversa.
The dataframe DSL will not handle all sub queries, currently. Using Spark SQL you will be able to tackle such situations beter. AGGRegations may need these as well, surely ...
UDFs are not optimizable by Catalyst and could result in less performant physical plans.
I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.
However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF, that every user defined function does 3 things:
convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
apply the function
convert the result back from scala type to catalyst type
Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?
Thank you so much for any suggestion or insight.
From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
When to use RDDs?
Consider these scenarios or common use cases for
using RDDs when:
you want low-level transformation and actions and control on your
dataset;
your data is unstructured, such as media streams or streams
of text;
you want to manipulate your data with functional programming
constructs than domain specific expressions;
you don’t care about
imposing a schema, such as columnar format, while processing or
accessing data attributes by name or column;
and you can forgo some
optimization and performance benefits available with DataFrames and
Datasets for structured and semi-structured data.
In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD
And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD
I think Dataset is schema RDD.
when you create Dataset,you should give StructType to it.
In fact, Dataset after logic plan and physical plan ,will generate RDD operator.Maybe this is RDD performance more than Dataset.