trino udf looking for good sample - presto

We store very complex json in one of our table columns. I would like to write a parser for this. I was reading thru table functions and functions but I never saw a great guide that would tell me how to create a function / deploy it to our cluster. Does anyone have any good pointers.

Related

Any benefits of using Pyspark code over SQL in Azure databricks?

I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !

How to parse big XML in google cloud function efficiently?

I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.

Does anyone know a dataset to test the delta lake/apache iceberg?

I'm looking for an example dataset (or several) to test Delta Lake and Apache Iceberg, but I couldn't find any.
I want to test the MERGE function of both and compare, but a small example is not possible to measure performance and define which one is better.
I would like a dataset with primary keys that starts with the first version of the table, and with multiple datasets (small or large) with the changes, that way I could test MERGE.
If anyone can help me, I appreciate it in advance.

Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.

Higher Order functions in Spark SQL query

I just got introduced to Spark SQL higher order functions transform(), filter() etc. I searched online, but couldn't find much advanced use-cases leveraging these functions.
Can anyone please explain transform() with a couple of advanced real-life use-case using sql query. Does it always need to work on nested complex types (arrays, struct etc) ? Or can it be used to process simple data-type records as well ?
Any help is appreciated.
Thanks
The following online resource vividly demonstrates in %sql mode :
https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Resources