Azure Data Lake - HDInsight vs Data Warehouse - azure

I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse.
This enables us to read from the data lake, using well known SQL.
However, another option is using Data Lake Analytics, or some variation of HDInsight.
Performance wise, I'm not seeing much difference. I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL?
With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database).

If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS.
Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. See this blogpost for more information:
A common use case for ADLS and SQL DW is the following. Raw data is ingested into ADLS from a variety of sources. Then ADL Analytics is used to clean and process the data into a loading ready format. From there, the high value data can be imported into Azure SQL DW via PolyBase.
..
You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table.

Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. Also, the use cases between ADLA/U-SQL and SDW are different.
ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. In your example above, you seem to be mainly doing the second part. Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows).

Related

How good is Azure Data Lake for storing an SQL database used for Power BI visualizations?

We have an Azure SQL database where we collect a large amount of sensor data and we regularly extract the data from it and transform it a bit with a python script. The end result is a pandas DataFrame file. We would like to store the transformed data in an Azure database and use it as a source of a power BI dashboard.
On the one hand, we want to show the "almost" real-time data on a dashboard (the latency due to the transformation etc. is acceptable, but the dashboard needs to refresh very frequently, let's say once a minute), but we also want to store the transformed data and query it later e.g. to visualize the data only for a given day.
Is it possible to convert the pandas DataFrame into SQL and store it on Data Lake and stream the data from there? I read that it is possible to store structured data on Data Lake and even query it, but I am unsure if this would be the best solution.
(My current task is to choose the best database for storing the transformed data to enable both streaming and querying it later. I am very new in Azure products and I don't have a sandbox account yet to even try around and identify possible pitfalls. I've just figured out that PowerBI does not support DirectQuery for DataLake and I feel like this can be an issue - meaning we would have to query the data on DataLake at first and store it somewhere if we wanted to visualize a subset, is that correct?)
Azure Datalake is not a database, just a store for the data both structured and unstructured, so as mentioned you can't direct query it unless you have some compute capacity (Databricks, Azure Synapse, Azure DataLake Analytics, Power BI Premium with enhanced compute)
Depending on your approach, it may be best to move from Azure SQL Database and Pandas, to Azure Databricks, that can ingest the streaming data, transform, and provide an outputted table that is stored in the data lake. You will then connect Power BI to the Databricks instance and query that. The data will only be available while the cluster is running.
Moving to Databricks, will involve rewriting your Panda code to Koalas, or preferably Pyspark.
You do have the option of using Databricks to write the items back to a Azure SQL Database table. Depending on what transformations you are doing you could keep it all in Azure SQL, or if it is sensor data streaming, take the data through Azure Event Hubs, to Azure Streaming Analytics (does transformations), to Azure SQL Database (store Realtime and historical).

Azure Data Lake for Structured Data

We've been reviewing the Modern Data Warehouse architectures from Microsoft (link here), which references using Azure Data Factory to pull structured and unstructured data into the Azure Data Lake. I've attended a lot of presentations on the subject as well, but most people are split on whether the Data Lake is a good home for structured data. What I am trying to determine is if importing data into the Data Lake is a good strategy if the only source we will be utilizing is on-prem SQL Server databases? And, what would be the advantage / disadvantages of that strategy?
For context sake, we're looking for a single pane of glass for consumption - whether it's end user's reporting with Power BI, or fodder for Azure Data Warehouse / on-prem Data Warehouse. We want one container that is the source for all of these systems, which is not the source OLTP system (i.e. OLTP database --> (Azure Data Factory) --> Data Lake --> everything else).
I appreciate any guidance on the subject. Thank you.
You have not mentioned the data size and I think for moving to ADL , the data is a very strong parameter . In your case the data is very much structured . If you we had unstructured & massive data and if you wanted to use ADB or Hadoop or any other technology to process it later , i think ADL is a good candidate .
You should also consider that the data is encrypted in motion using SSL .You can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls .
The only real value in taking stuctured data, flattening it and loading it into a data lake is to save cost and decouple the data from any proprietary tool/compute. In your scenario, it will be less expensive to store the data in a data lake store vs. Azure SQL Database.
However, there is a complexity cost to flattening the data. You will need to restructure the data (ie. load it back into a database, or wrap logical structure) when you need to consume the data. Formats such as Parquet will help with this, but it is more complex for users to query data in a datalake than it is to connect to a relational database. Most all analysts and data consumers will know how to query a relational database, especially if the data is already in SQL Server.
Look at the volume of data and use cases for consumption to make that decision. A "logical datalake" can include both structured data in a relational database, semi structured data flattened in a storage account, and unstructured data saved to a storage account.

How to perform data factory transformations on large datasets in Azure data warehouse

We have Data warehouse tables that we perform transformations using ADF.
If I have a group of ADW tables, and I need to perform transformations on them to land them back onto ADW, should I save the transformations into Azure Blob Storage? or go direct into the target table.
The ADW tables are in excess of 100Mil records.
Is it an acceptable practice to use Blob Storage as the middle piece.
I can think of two ways to do this (they do not require moving the data into blob storage),
Do the transformation within SQL DW using stored procedure and use ADF to orchestrate the stored procedure call
Use ADF's data flow to apply the transformation to read from SQL DW and write back to SQL DW
Yes, you'd better using the use Blob Storage as the middle piece.
You can not copy the tables from SQL DW(Source) to the same SQL DW(Sink) directly! If you have tried this, you will have the problems:
Copy data: has the error in data mapping, copy data to the same table, not create new tales.
Copy Active: Table is required for Copy activity.
If you want to copy the data from SQL DW tables to new tables with Data Factor, you need at least two steps:
copy the data from the SQL DW tables to Blob storage(create the csv
files).
Load these csv files to SQL DW and create new tables.
Reference tutorials:
Copy and transform data in Azure Synapse Analytics (formerly Azure
SQL Data Warehouse) by using Azure Data Factory
Copy and transform data in Azure Blob storage by using Azure Data
Factory
Data Factory is good at transfer big data. Reference Copy performance of Data Factory. I think it may faster than SELECT - INTO Clause (Transact-SQL).
Hope this helps.

Reading data from lake

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries
I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.
I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Benchmark test for polybase with azure data lake

Have anyone performed benchmark test using polybase with adl, I want to know if I am having a data file which is having 4milion rows, will polybase be helpful in fetching those rows to the data warehouse. Can anyone post any articles where I Can learn about these things.
Yes Microsoft have conducted some trials, for example:
Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Data Factory
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse
This is using Data Factory but it's really Polybase under the hood doing the heavy lifting. Now, it was using Polybase with Blob Storage (not Data Lake) but you get the idea. As an experiment, why don't you set this up, run it, then convert it to use Data Lake and report back?

Resources