In Databricks, SQL uses spark? - apache-spark

I have a notebook in databricks where I only have SQL queries, I want to know if it's better (talking about performance) to switch all of them to pyspark or if it would be the same.
In other words I want to know if databricks-sql uses spark-sql to execute the queries.
I found this question (looks pretty similar to mine), but the answer is not what I want to know.

Yes, you can definitely use PySpark in place of SQL.
The decision mostly depends on the type of data store. If your data is stored in database then SQL is the best option. If you are working with DataFrames, then PySpark is the good options as it gives you more flexibility and features with supported libraries.
It uses SparkSQL and DataFrame APIs.
Dataframe uses tungsten memory representation , catalyst optimizer used by SQL as well as DataFrame. With Dataset API, you have more control on the actual execution plan than with SparkSQL.
Refer PySpark for more details and better understanding.

Related

Any benefits of using Pyspark code over SQL in Azure databricks?

I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !

Sample Spark Evaluator code for Streamsets

I am trying to write a spark evaluator in Streamsets. I have to deal with complex SQL queries and hence would want to use data frames or datasets here. But the sample code which Streamsets provides deals with JavaRDD only. Can I have an insight on dataframe to get some headstart here ?
You are almost certainly better off looking at using StreamSets Transformer. Transformer has a much deeper Spark integration and will allow you to work with native Spark structures.

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

Can Sqoop be used to perform joins on the IMPORT?

I was asked this question recently where I was describing a use case which involved multiple joins in addition to some processing that I had implemented in Spark, the question was, could the joins have not been done while importing the data to HDFS using Sqoop? I wanted to understand from an architectural standpoint if it's advisable to implement the joins in Sqoop even if it's possible.
It is possible to do joins in sqoop imports.
From an architecture point of view, It depends on your usecase, sqoop is mainly a utility for fast imports/exports. All the etl can be done through spark/pig/hive/impala.
Although it is doable, I would recommend not to, since it will increase your job's time efficiency plus it will put load on your source for computing joins/aggregations as well also sqoop was primarily designed to be an ingestion tool for structured sources.
It depends on the infrastructure of your data pipeline, if you are using Spark for some other purpose then it will be better to use the same Spark for importing the data as well. Sqoop support join and will be sufficient if you only need to import data and nothing else. Hope this answers your query.
You can use:
a view in the DBMS where reading from using sqoop eval to set parameters in DB there, optionally.
freeform SQL for sqoop wher JOIN defined
However, views with JOINs cannot be used for incremental imports.
The facility of using free-form query in the current version of Sqoop
is limited to simple queries where there are no ambiguous projections
and no OR conditions in the WHERE clause. Use of complex queries such
as queries that have sub-queries or joins leading to ambiguous
projections can lead to unexpected results.
Sqoop import tool supports join. It can be archived using --query option (Don't use this option with --table / --column).

When to use Spark DataFrame/Dataset API and when to use plain RDD?

Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.
However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF, that every user defined function does 3 things:
convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
apply the function
convert the result back from scala type to catalyst type
Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?
Thank you so much for any suggestion or insight.
From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
When to use RDDs?
Consider these scenarios or common use cases for
using RDDs when:
you want low-level transformation and actions and control on your
dataset;
your data is unstructured, such as media streams or streams
of text;
you want to manipulate your data with functional programming
constructs than domain specific expressions;
you don’t care about
imposing a schema, such as columnar format, while processing or
accessing data attributes by name or column;
and you can forgo some
optimization and performance benefits available with DataFrames and
Datasets for structured and semi-structured data.
In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD
And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD
I think Dataset is schema RDD.
when you create Dataset,you should give StructType to it.
In fact, Dataset after logic plan and physical plan ,will generate RDD operator.Maybe this is RDD performance more than Dataset.

Resources