I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !
Related
I have a notebook in databricks where I only have SQL queries, I want to know if it's better (talking about performance) to switch all of them to pyspark or if it would be the same.
In other words I want to know if databricks-sql uses spark-sql to execute the queries.
I found this question (looks pretty similar to mine), but the answer is not what I want to know.
Yes, you can definitely use PySpark in place of SQL.
The decision mostly depends on the type of data store. If your data is stored in database then SQL is the best option. If you are working with DataFrames, then PySpark is the good options as it gives you more flexibility and features with supported libraries.
It uses SparkSQL and DataFrame APIs.
Dataframe uses tungsten memory representation , catalyst optimizer used by SQL as well as DataFrame. With Dataset API, you have more control on the actual execution plan than with SparkSQL.
Refer PySpark for more details and better understanding.
I'm searching for tool for storing documentation about tables, datasources, etl processes and etc for my DWH.
I've seen some presentations on youtube, but I've found out, that most of the companies are using custom, own system or something like wiki ith plain text descriptions.
I think, that it is not so useful for Analysts, Mangers and other user to find out , what they need and how to use data to calculate suitable for them statistics.
Can you suggest, please, what may I use for this case? What I must read?
While Airflow was baked with some support for Apache-Atlas, in my opinion
the one of the best data-lake metadata management tools right now is Lyft's Amundsen
and they've also released lyft/amundsendatabuilder, the introduction of which says
Amundsen Databuilder is a data ingestion library, which is inspired by
Apache Gobblin. It could be used in an orchestration
framework(e.g. Apache Airflow) to build data from Amundsen. You could
use the library either with an adhoc python script(example) or
inside an Apache Airflow DAG(example).
I am on jooq queries now...I feel the SQL queries looks more readable and maintainable and why we need to use JOOQ instead of using native SQL queries.
Can someone explains few reason for using the same?
Thanks.
Here are the top value propositions that you will never get with native (string based) SQL:
Dynamic SQL is what jOOQ is really really good at. You can compose the most complex queries dynamically based on user input, configuration, etc. and still be sure that the query will run correctly.
An often underestimated effect of dynamic SQL is the fact that you will be able to think of SQL as an algebra, because instead of writing difficult to compose native SQL syntax (with all the keywords, and weird parenthesis rules, etc.), you can think in terms of expression trees, because you're effectively building an expression tree for your queries. Not only will this allow you to implement more sophisticated features, such as SQL transformation for multi tenancy or row level security, but every day things like transforming a set of values into a SQL set operation
Vendor agnosticity. As soon as you have to support more than one SQL dialect, writing SQL manually is close to impossible because of the many subtle differences in dialects. The jOOQ documentation illustrates this e.g. with the LIMIT clause. Once this is a problem you have, you have to use either JPA (much restricted query language: JPQL) or jOOQ (almost no limitations with respect to SQL usage).
Type safety. Now, you will get type safety when you write views and stored procedures as well, but very often, you want to run ad-hoc queries from Java, and there is no guarantee about table names, column names, column data types, or syntax correctness when you do SQL in a string based fashion, e.g. using JDBC or JdbcTemplate, etc. By the way: jOOQ encourages you to use as many views and stored procedures as you want. They fit perfectly in the jOOQ paradigm.
Code generation. Which leads to more type safety. Your database schema becomes part of your client code. Your client code no longer compiles when your queries are incorrect. Imagine someone renaming a column and forgetting to refactor the 20 queries that use it. IDEs only provide some degree of safety when writing the query for the first time, they don't help you when you refactor your schema. With jOOQ, your build fails and you can fix the problem long before you go into production.
Documentation. The generated code also acts as documentation for your schema. Comments on your tables, columns turn into Javadoc, which you can introspect in your client language, without the need for looking them up in the server.
Data type bindings are very easy with jOOQ. Imagine using a library of 100s of stored procedures. Not only will you be able to access them type safely (through code generation), as if they were actual Java code, but you don't have to worry about the tedious and useless activity of binding each single in and out parameter to a type and value.
There are a ton of more advanced features derived from the above, such as:
The availability of a parser and by consequence the possibility of translating SQL.
Schema management tools, such as diffing two schema versions
Basic ActiveRecord support, including some nice things like optimistic locking.
Synthetic SQL features like type safe implicit JOIN
Query By Example.
A nice integration in Java streams or reactive streams.
Some more advanced SQL transformations (this is work in progress).
Export and import functionality
Simple JDBC mocking functionality, including a file based database mock.
Diagnostics
And, if you occasionally think something is much simpler to do with plain native SQL, then just:
Use plain native SQL, also in jOOQ
Disclaimer: As I work for the vendor, I'm obviously biased.
I was asked this question recently where I was describing a use case which involved multiple joins in addition to some processing that I had implemented in Spark, the question was, could the joins have not been done while importing the data to HDFS using Sqoop? I wanted to understand from an architectural standpoint if it's advisable to implement the joins in Sqoop even if it's possible.
It is possible to do joins in sqoop imports.
From an architecture point of view, It depends on your usecase, sqoop is mainly a utility for fast imports/exports. All the etl can be done through spark/pig/hive/impala.
Although it is doable, I would recommend not to, since it will increase your job's time efficiency plus it will put load on your source for computing joins/aggregations as well also sqoop was primarily designed to be an ingestion tool for structured sources.
It depends on the infrastructure of your data pipeline, if you are using Spark for some other purpose then it will be better to use the same Spark for importing the data as well. Sqoop support join and will be sufficient if you only need to import data and nothing else. Hope this answers your query.
You can use:
a view in the DBMS where reading from using sqoop eval to set parameters in DB there, optionally.
freeform SQL for sqoop wher JOIN defined
However, views with JOINs cannot be used for incremental imports.
The facility of using free-form query in the current version of Sqoop
is limited to simple queries where there are no ambiguous projections
and no OR conditions in the WHERE clause. Use of complex queries such
as queries that have sub-queries or joins leading to ambiguous
projections can lead to unexpected results.
Sqoop import tool supports join. It can be archived using --query option (Don't use this option with --table / --column).
Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.