Optimizing Azure Analysis Services processing on Synapse Analytics

Optimizing Azure Analysis Services processing on Synapse Analytics - azure

I have a massive model with billions of records within Synapse Analytics that needs to be loaded into Azure Analysis Services. At present we have the tables partitioned by month, datatypes and joins are perfect and we have reduced the tables to ONLY the columns we need. We have scaled Synapse to DWU 5000c and Azure Analysis Services to S9v2 (the max setting). Given above it still takes hours and hours to process full for the first load with up to 64 partitions running in parallel given concurrency in Synapse.
I am curious, is there a setting that can enable a 'mass bulk import process full' that I am missing at this point?
For example here: https://www.businessintelligenceinfo.com/business-intelligence/self-service-bi/building-an-azure-analysis-services-model-on-top-of-azure-blob-storage-part-3, he is loading differently as from storage blobs and was able to load 1 TB by changing the config as: ReadBlobData(Source, ““, 1, 50). If there is any similar config for Synapse to Analysis Services?

Related

Moving Data from Azure Blob Storage to Azure Synapse(Sql dedicated pools)

I have a requirement to move Azure Blob Storage data to Azure Synapse (SQL dedicated pools).
Azure Blob Storage container has around 850 Gb of data(in form of multiple json files). I have created a Synapse pipeline . I have used polybase to move the data from blob storage to SQL dedicated pools. In case of Polybase we would need a staging environment for which i have used a staging blob container.
Azure Blob storage -> staging container -> SQL dedicated pool(Azure Synapse)
I have not kept any restrictions on DIU and parallel processing so it uses 32 DIU and parallel goes processing numbers goes upto 120-130 .
first stage is completed in 5 hrs moving 850gb of data to staging container but the second stage it still runs for 15 hours now but not yet completed and DIU i can see is 2 and parallel processing 1 .
Do i need to explicitly specify the DIU and parallel processing .
Is there any better way to do this except polybase.

There are 3 keys points missing from your question:
Why are you using Staging Table?
Do you need to perform any transformation on the data?
Why your staging table is in Blob Storage?
As per the best approach:
It is best practice to load data into a staging table. Staging tables
allow you to handle errors without interfering with the production
tables. A staging table also gives you the opportunity to use SQL pool
built-in distributed query processing capabilities for data
transformations before inserting the data into production tables.
So, as you haven't mentioned if your approach is Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT); reason for staging table isn't clear. If your approach is ETL, staging is good else it is not required.
Secondly, Staging Tables should be in SQL Pool, not in blob storage.
Before loading the data in Staging Table, you need to define external tables in your data warehouse. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the data warehouse.
You also need to take care of resource class and distribution method. Refer this third-party article to know more about these 2 workload management terminologies.
As there are so many parameters you need to consider to find out what exactly the issue is, I suggest you to first go through important official documents and then make appropriate changes in your architecture.
Helpful links:
Design a PolyBase data loading strategy for dedicated SQL pool in Azure Synapse Analytics
Workload management with resource classes in Azure Synapse Analytics
Best practices for loading data into a dedicated SQL pool in Azure Synapse Analytics

batch processing in azure

We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?

There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.

Migrate data from on-prem DW to Azure

I have an on-prem Dat Warehouse using SQL Server, what is the best way to load the data to SQL Data Warehouse?

The process of loading data depends on the amount of data. For very small data sets (<100 GB) you can simply use the bulk copy command line utility (bcp.exe) to export the data from SQL Server and then import to Azure SQL Data Warehouse. For data sets greater than 100 GB, you can export your data using bcp.exe, move the data to Azure Blob Storage using a tool like AzCopy, create an external table (via TSQL code) and then pull the data in via a Create Table As Select (CTAS) statement.
Using the PolyBase/CTAS route will allow you to take advantage of multiple compute nodes and the parallel nature of data processing in Azure SQL Data Warehouse - an MPP based system. This will greatly improve the data ingestion performance as each compute node is able to process a block of data in parallel with the other nodes.
One consideration as well is to increase the amount of DWU (compute resources) available in SQL Data Warehouse at the time of the CTAS statement. This will increase the number of compute resources adding additional parallelism which will decrease the total ingestion time.

SQL database migration wizard is a helpful tool to migrate schema and data from an on-premise database to Azure sql databases.
http://sqlazuremw.codeplex.com/

Load data to Azure SQL DW

I have large amount of data to be loaded for SQL DW. What is the best way to get the data to Azure? Should I use Import/Export or AzCopy? How long would it take for each methods?

The process of loading data depends on the amount of data. For very small data sets (<100 GB) you can simply use the bulk copy command line utility (bcp.exe) to export the data from SQL Server and then import to Azure SQL Data Warehouse.
For data sets greater than 100 GB, you can export your data using bcp.exe, move the data to Azure Blob Storage using a tool like AzCopy, create an external table (via TSQL code) and then pull the data in via a Create Table As Select (CTAS) statement. This works well update to a TB or two depending on your connectivity to the cloud.
For really large data sets, say greater than a couple of TBs, you can use the Azure Import/Export service to move the data into Azure Blob Storage and then load the data with PolyBase/CTAS.
Using the PolyBase/CTAS route will allow you to take advantage of multiple compute nodes and the parallel nature of data processing in Azure SQL Data Warehouse - an MPP based system. This will greatly improve the data ingestion performance as each compute node is able to process a block of data in parallel with the other nodes.
One consideration as well is to increase the amount of DWU (compute resources) available in SQL Data Warehouse at the time of the CTAS statement. This will increase the number of compute resources adding additional parallelism which will decrease the total ingestion time.

You can go through the documentation below and figure out which option suits you best.
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-overview-load/
If you already have data in an on-premise SQL Server, you can use the migration wizard tool to load that data to Azure SQL DB.
http://sqlazuremw.codeplex.com/

How well does Azure SQL Data Warehouse scale?

I want to replace all my on-prem DW on SQL Server and use Azure SQL DW. My plan is to remove the spoke and hub model that I currently use for my on-prem SQL and basically have a large Azure SQL DW instance that scale with my client base (currently at ~1000). Would SQL DW scale or I need to retain my spoke and hub model?

Azure SQL Data Warehoue is a great choice for removing your hub and spoke model. The service allows you to scale storage and compute independently to meet the compute and storage needs of your data warehouse. For example, you may have a large number of data sets that are infrequently accessed. SQL Data Warehouse allows you to have a small number of compute resources (to save costs) and using SQL Server features like table partitioning access only the data in the "hot" partitions efficiently - say the last 6 months.
The service offers the ability to adjust the compute power by moving a slider up or down - with the compute change happening in about 60 seconds (during the preview). This allows you to start small - say with a single spoke - and add over time making the migration to the cloud easy. As you need more power, you can simply add DWU/Compute resources by moving the slider to the right.
As the compute model scales, the number of query and loading concurrency slots increase offering you the ability to support larger numbers of customers. You can read more about Elastic performance and scale with SQL Data Warehouse on Azure.com

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string