Using SQL Stored Procedure vs Databricks in Azure Data Factory - azure

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks

I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.

As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Related

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?

The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage are much cheaper but are there any bigger impacts in terms of queries performance without the data modelling? In spark 3.0 onwards I see all the cool features like Adaptive Query Engine, Dynamic Partition Pruning etc., but is the dimensional modelling becoming obsolete because of that? If anyone implemented dimensional modelling with Databricks share your thoughts?
Not really a question for here, but interesting.
Of course Databricks et al are selling their Cloud solutions - I'm fine with that.
Taking this video https://go.incorta.com/recording-death-of-the-star-schema into account - whether paid for or the real opinion of Imhoff:
The computing power is higher at lower cost - if you manage it and you can more things on the fly.
That said, the same could be stated with SAP Hana, where you do ETL on the fly. I am not sure why every time I would want to have a virtual creation of a type 2 dimension.
Star schemas require thought and maintenance, but show focus. Performance is less of an issue.
It is true that ad hoc queries do not work well with star schemas over multiple fact tables. Try it.
Databricks has issues with sharing Clusters with SCALA, if you do it their way with pyspark it is OK.
It remains to be seen if querying via Tableau works well on Delta Lake - I need to see it for myself. In the past we had thrift server etc. for this and it did not work, but things are different now.
Where I am now we have Data Lake on HDP with delta format - and a
dimensional SQL Server DWH. The latter due to the on-premises aspects
of HDP.
Not having star schemas means people need more skills to query.
If I took ad hoc querying then I would elect the Lakehouse, but
actually I think you need both. It's a akin to the discussion do you
need ETL tools if you have Spark.
The Kimball's star schema and Data Vault modeling techniques are still relevant for Lakehouse patterns, and mentioned optimizations, like, Adaptive Query Execution, Dynamic Partition Pruning, etc., combined with Data Skipping, ZOrder, bloom filters, etc. are making queries very efficient.
Really, Databricks data warehousing specialists recently published two related blog posts:
Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform: Using Data Vaults and Star Schemas on the Lakehouse
Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform
In our use case we access the lakehouse using PowerBI + Spark SQL and being able to significantly reduce the data volume the queries return by using the star schema makes the experience faster for the end-user and saves compute resources.
However considering things like the columnar nature of parquet files and partition pruning which both also decrease the data volume per query, I can imagine scenarios in which a reasonable setup without star schema could work.

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?
This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH
One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

Key differences between Azure DocumentDB and Azure Table Storage

I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide

ColumnStore index benefits on Azure?

We are currently running on Azure and we have a table with hundreds of millions of rows. This table is static and will be refreshed weekly. We've looked at ColumnStore index but unfortunately it is not Azure yet so below are my questions,
Will ColumnStore index be available in Azure?
if not what other technology can we use to get the same performance
benefits as the ColumnStore index would provide?
Can we get the same query performance by using Azure Table Storage?
I'm a newbie to both Azure and Columnar databases so please bear me with me if I ask these questions.. :)
About ColumnStore, if you have bought the license, you can check with development team or ask on blogs such as ScottGu's Blog. From there only you will come to know about any feature release.
Azure Database is designed for scalability. You will need to use the Partition Key very wisely. Partition Key is like index of book, so if you want to search something in book, you can quickly refer to the index and reach the page quickly. In other words, you can group data depending upon certain criteria and store it in a single partition. So where ever you have the same criteria, your query will hit only one partition. The thing with partitions is, for a table you can any number of partition, but it is not necessary that all the partition will reside on same machine or even same farm. So when you fire a query on badly designed Azure Table, it can hit more than one server, and thus bad performance. Read about Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Hope you get what you are looking for.
As Amar pointed out, keep an eye on the team blogs for the latest in new feature announcements. The goal for SQL Azure is for it to eventually be where new features are found first. However, it will still take awhile for things to get there.
As for your peformance question, there's no simple answer for this. Windows Azure resources are designed for scale, not necessarially high performance. So its to take your scale/capacity targets into account when designing solutions. For your situation, I would encourage you to conside table storage, but this will depend on frequency access and the types of queries you need to make on the data. Just do not be surprised if you have to mave redundant copies of your data that are modelled differently, or possibly even running parrallel queries and aggregating results. This is the way table storage was designed to be used. Its cheaper then SQL Azure and its this price difference that makes redundant specialized data models possible.
This approach also has to be weighed against the cost of retraining your developers to stop thinking in RDBMS terms. :)

Resources