Reading data from lake - azure

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries

I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.

I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Related

Is Azure Synapse Link a good way of loading the data in a Data Warehouse?

Azure Synapse Analytics is the datawarehouse solution from Azure.
There are 3 ways to load the data into the warehouse:
COPY statement
PolyBase
Bulk insert
The fastest and most scalable way to load data is through the COPY statement or the PolyBase.
However now it is also possible to load the data through Synapse Links. Which allows near-real time data.
But I do not see any documentation referring to Synapse Links being used in a traditional Data Warehouse for analytics.
The use cases in the documentation are:
Supply chain analytics, forecasting & reporting
Real-time personalization
Predictive maintenance, anomaly detection in IOT scenarios
Which are use cases that need real time data.
I do not need near real time data. Therefore I assume "Synapse Link" has some disadvantages for a traditional data warehouse solution.
Could someone please tell me their knowledge about using "Synapse Link" in a traditional analytics data warehouse ?
Thanks in advance
With "Traditional datawarehouse solution" I assume you have ETL processes, that load/refresh your DWH say once a day.
The Synapse Link is a very convenient way to import Cosmos DB or Dataverse Data into a Data Lake connected to Synapse. The "real time" part of it shouldn't bother you, because you can always use batch jobs (dataflows) to load the data periodically from the lake into your datawarehouse.
With the Synapse Link you save time and development effort to bring the data properly from the Cosmos DB or Dataverse into your analytical environment. It works great for us.

How good is Azure Data Lake for storing an SQL database used for Power BI visualizations?

We have an Azure SQL database where we collect a large amount of sensor data and we regularly extract the data from it and transform it a bit with a python script. The end result is a pandas DataFrame file. We would like to store the transformed data in an Azure database and use it as a source of a power BI dashboard.
On the one hand, we want to show the "almost" real-time data on a dashboard (the latency due to the transformation etc. is acceptable, but the dashboard needs to refresh very frequently, let's say once a minute), but we also want to store the transformed data and query it later e.g. to visualize the data only for a given day.
Is it possible to convert the pandas DataFrame into SQL and store it on Data Lake and stream the data from there? I read that it is possible to store structured data on Data Lake and even query it, but I am unsure if this would be the best solution.
(My current task is to choose the best database for storing the transformed data to enable both streaming and querying it later. I am very new in Azure products and I don't have a sandbox account yet to even try around and identify possible pitfalls. I've just figured out that PowerBI does not support DirectQuery for DataLake and I feel like this can be an issue - meaning we would have to query the data on DataLake at first and store it somewhere if we wanted to visualize a subset, is that correct?)
Azure Datalake is not a database, just a store for the data both structured and unstructured, so as mentioned you can't direct query it unless you have some compute capacity (Databricks, Azure Synapse, Azure DataLake Analytics, Power BI Premium with enhanced compute)
Depending on your approach, it may be best to move from Azure SQL Database and Pandas, to Azure Databricks, that can ingest the streaming data, transform, and provide an outputted table that is stored in the data lake. You will then connect Power BI to the Databricks instance and query that. The data will only be available while the cluster is running.
Moving to Databricks, will involve rewriting your Panda code to Koalas, or preferably Pyspark.
You do have the option of using Databricks to write the items back to a Azure SQL Database table. Depending on what transformations you are doing you could keep it all in Azure SQL, or if it is sensor data streaming, take the data through Azure Event Hubs, to Azure Streaming Analytics (does transformations), to Azure SQL Database (store Realtime and historical).

Benchmark test for polybase with azure data lake

Have anyone performed benchmark test using polybase with adl, I want to know if I am having a data file which is having 4milion rows, will polybase be helpful in fetching those rows to the data warehouse. Can anyone post any articles where I Can learn about these things.
Yes Microsoft have conducted some trials, for example:
Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Data Factory
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse
This is using Data Factory but it's really Polybase under the hood doing the heavy lifting. Now, it was using Polybase with Blob Storage (not Data Lake) but you get the idea. As an experiment, why don't you set this up, run it, then convert it to use Data Lake and report back?

Why is Polybase slow for large compressed files that span 1 billion records?

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?
For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.
Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.
What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?
In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.
Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.
If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.
Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.
You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.
Also look in to:
CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

Load data to Azure SQL DW

I have large amount of data to be loaded for SQL DW. What is the best way to get the data to Azure? Should I use Import/Export or AzCopy? How long would it take for each methods?
The process of loading data depends on the amount of data. For very small data sets (<100 GB) you can simply use the bulk copy command line utility (bcp.exe) to export the data from SQL Server and then import to Azure SQL Data Warehouse.
For data sets greater than 100 GB, you can export your data using bcp.exe, move the data to Azure Blob Storage using a tool like AzCopy, create an external table (via TSQL code) and then pull the data in via a Create Table As Select (CTAS) statement. This works well update to a TB or two depending on your connectivity to the cloud.
For really large data sets, say greater than a couple of TBs, you can use the Azure Import/Export service to move the data into Azure Blob Storage and then load the data with PolyBase/CTAS.
Using the PolyBase/CTAS route will allow you to take advantage of multiple compute nodes and the parallel nature of data processing in Azure SQL Data Warehouse - an MPP based system. This will greatly improve the data ingestion performance as each compute node is able to process a block of data in parallel with the other nodes.
One consideration as well is to increase the amount of DWU (compute resources) available in SQL Data Warehouse at the time of the CTAS statement. This will increase the number of compute resources adding additional parallelism which will decrease the total ingestion time.
You can go through the documentation below and figure out which option suits you best.
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-overview-load/
If you already have data in an on-premise SQL Server, you can use the migration wizard tool to load that data to Azure SQL DB.
http://sqlazuremw.codeplex.com/

Resources