I just got introduced to Spark SQL higher order functions transform(), filter() etc. I searched online, but couldn't find much advanced use-cases leveraging these functions.
Can anyone please explain transform() with a couple of advanced real-life use-case using sql query. Does it always need to work on nested complex types (arrays, struct etc) ? Or can it be used to process simple data-type records as well ?
Any help is appreciated.
Thanks
The following online resource vividly demonstrates in %sql mode :
https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
Related
We store very complex json in one of our table columns. I would like to write a parser for this. I was reading thru table functions and functions but I never saw a great guide that would tell me how to create a function / deploy it to our cluster. Does anyone have any good pointers.
I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !
I came across this interesting date conversion table for postgresql and am wondering if this syntax for date patterns can also be used in presto. It might or might not work, anyone familiar with this?
Equivalent functionality is available through the date_format() Trino (formerly Presto) function
The to_char function does not exist in Presto.
Documentation contains the list of builtin Presto functions. When in doubt, you can get a list of all the functions available using the SHOW FUNCTIONS statement.
I started exploring ADX a few days back. I imported my data from Azure SQL to ADX using ADF pipeline but when I query those data, it is taking a long time. To find out some workaround I researched for Table Data Partitioning and I am much clear on partition types and tricks.
The problem is, I couldn't find any sample (Kusto Syntax) that guide me to define Paritionging on ADX Database Tables. Can anyone please help me with this syntax?
partition operator is probably what you are looking for:
T | partition by Col1 ( top 10 by MaxValue )
T | partition by Col1 { U | where Col2=toscalar(Col1) }
ADX doesn't currently have the notion of partitioning a table, though it may be added in the future.
that said, with the lack of technical information currently provided, it's somewhat challenging to understand how you got to the conclusion that partitioning your table is required and is the appropriate solution, as opposed to other (many) directions that ADX does allow you pursue.
if you would be willing to detail what actions you're performing, the characteristics of your data & schema, and which parts are performing slower than expected, that may help in providing you a more meaningful and helpful answer.
[if you aren't keen on exposing that information publicly, it's ok to open a support ticket with these details (through the Azure portal)]
(update: the functionality is available for a while now. read more # https://yonileibowitz.github.io/blog-posts/data-partitioning.html)
The Spark version 1.5+ has windowing functions. I believe there were a comprehensive documentation for SQL somewhere but have been unsuccessful to find it .
Here is the docs for spark dataframe and sql: it does NOT have the content sought:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
I have googled a number of different ways and unable to find the comprehensive guide to available sql functions. The closest I could find is "spark 1.5 new Dataframe operations" here:
https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
Update I am looking specifically for a SQL reference - not an API (/scaladoc) reference. I.e. a reference showing the provided sql functions, what their arguments are, semantics, and maybe example usage.
There is a page about Windowing and analytics in the Wiki which covers the window specification, aggregate functions, and it contains some examples.
How about this? This is spark2.4.0
https://spark.apache.org/docs/2.4.0/api/sql/index.html#last_value
Databricks had a good introduction to window functions at https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html but the definitive documentation should always be the API docs, scroll right to the bottom
Dataframe functions API documentation