Azure elastic query performance - azure

We have two applications that use separate databases on SQL Server 2012, however we have several stored procedures that get data from the other using INNER JOINs (7 joins in total). We are looking to see if it is possible to move to Azure and have set up a test using our existing databases, with external tables to get the data from the other database.
The problem is that the performance of these queries goes from 1-15 seconds on our server, to 4+ minutes on Azure. We have tried moving the tables to the same database and it did fix the speed problem, although it isn't ideal to move all the tables over to the same DB.
For the purpose of our test, we are using Azure Standard elastic pool with 50 DTUs.

Cross database queries show good performance when the remote tables are not big. When remote tables are big, this article shows you how to perform joins remotely using table variables and improve performance.
This other article shows you also how to push parameterized operations to remote databases and improve performance.
Hope this helps.

Related

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Some input on how to proceed on the migration from SQL Server

I'm migrating from SQL Server to Azure SQL and I'd like to ask you who have more experience in Azure(I have basically none) some questions just to understand what I need to do to have the best migration.
Today I do a lot of cross database queries in some of my tasks that runs once a week. I execute SPs, run selects, inserts and updates cross the dbs. I solved the executions of SPs by using external data sources and sp_execute_remote. But as far as I can see it's only possible to select from an external database, meaning I won't be able to do any inserts or updates cross the dbs. Is that correct? If so, what's the best way to solve this problem?
I also read about cross db calls are slow. Does this mean it's slower that in SQL Server? I want to know if I'll face a slower process comparing to what I have today.
What I really need is some good guidelines on how to do the best migration without spending loads of time with trial and error. I appreciate any help in this matter.
Cross database transactions are not supported in Azure SQL DB. You connect to a specific database, and can't use 3 part names or use the USE syntax.
You could open up two different connections from your program, one to each database. It doesn't allow any kind of transactional consistency, but would allow you to retrieve data from one Azure SQL DB and insert it in another.
So, at least now, if you want your database in Azure and you can't avoid cross-database transactions, you'll be using an Azure VM to host SQL Server.

Options to copy multiple tables with one million rows from on-prem SQL 2012 to SQL Azure

I have around 20 tables in my On-prem SQL server 2012 that I like to copy them to a SQL Azure instance on a schedule basis.
In each table, I have around one millions records.
I need to schedule the table copy five times a day.
What are my options to copy this volume of data from SQL2012 to SQL Azure?
Considering the bandwidth limitations between on-prem data center and azure, Is this a feasible requirement?
Thank you,
Assuming
A) you have some .NET programming knowledge and B) the data should completely replace what is currently in the azure tables I would do the following.
Write some .NET code to read the data from the local db into a IDataReader, you don't mention about FK dependencies so you may have to take this into account.
Use SQLBulkCopy to insert the data into holding tables in azure such as
_myrealtablename. We use it we this in entity Framework EntityFramework.BulkInsert
then when the data is uploaded, rename the tables within a transaction so the new tables have replaced the previous.
We see performance of ~100000 rows per second uploading in this manner on fairly wide tables. A few caveats would be how much data is changing, as timestamps and just updating changed and new entities may be easy.
If you wanted something more rudimentary you could use BCP to export and import the data
Check out transactional replication: https://azure.microsoft.com/en-us/blog/transactional-replication-to-azure-sql-db/.
Mihaela

SQL Azure STDistance performance

I have a big performance problem with STDistance function on SQL Azure.
I'm testing the same query
SELECT Coordinate
FROM MyTable
WHERE Coordinate.STDistance(#Center) < 50000
on a SQL Azure database (Standard) and on my local machine database.
Same database, same indexes (a spatial index on Coordinate), same data (400k rows) but I got two different execution time.
The query takes less than 1 second in my local workstation and more or less 9 seconds on SQL Azure.
Anybody else has the same problem?
Federico
You can try following things to reduce network latency:
Select the data center closest to majority of your users
Co-Locate your DB with your application if your application is in Windows Azure as well
Minimize network round trips in your app
I would highly recommend you read this Azure SQL DB Perf guidance.
In addition to that, please check the existing service tier of your database and see if the performance is capping out. In that case, you might want to upgrade the service tier of your DB. If you would like to monitor the performance and adjust the performance levels, please use this link.
Thanks
Silvia Doomra
Query performance depends on various factors, one among them is your performance tier. Verify if you are hitting your resource limits (sys.resource_stats dmv from the master database)
Besides that there are a few other factors you can consider verifying:
index fragmentation on azure, network latency, locking etc.
Application level caching helps avoid hitting the database if the query is repeating.
You may also have to investigate on which Service-Tier and Performance level is required based on the Benchmarks here, AzureSQL-ServierTier_PerformanceLevel

Resources