I'm no MS expert - recently hopped onto the Azure train and apologies in advance if I get some information wrong.
Basically need some input in Azure's architecture utilising Azure Data Factory (as the ETL/ELT tool) and Azure SQL database (as the storage), to a BI output - Power BI. My situation is this;
I have on-premise data sources such as Oracle DB, Oracle Cloud SSAS, MS SQL server db
I'd like to have a MS cloud infrastructure solution for reporting purposes.
No data migration needed - merely pumping on-prem data onto cloud and producing a BI reporting solution
Based on my limited knowledge and Google research, Azure Data Factory caters for all my on-prem sources, as well as the future cloud Azure SQL database. If future analysis is needed, Azure Storage and Azure Databricks can be added in to this architecture. I have sketched out the architecture of my proposed solution.
Just confirming my understanding
Without Azure Storage & Databricks (the 2 pink boxes), the 2 Azure component (DF & SQL database) is sufficient to take data from on-premise sources, process on cloud & output into Power BI.
With Azure Storage & Databricks (the 2 pink boxes), processing will be more efficient as their summarised function is to store training data models & act as an analytics processing engine.
Azure SQL database is more suitable, as compared to Azure SQL datawarehouse as my data sources does not exceed 1TB; cost-wise is cheaper AND one of my data sources contain data from call centers, hence OLTP is more suitable. Plus I have Azure Databricks to support the analytical bit that SQL datawarehouse does (OLAP).
Any other comments to help me understand this whole architecture will be great!
I am a new learner of Azure. I was wondering if we have #Query (value="...") kind or any equivalence for DocumentDb (CosmosDB). Because, the documentDB does not take #Query. I am looking to convert the sql query (From jpa to cosmosDB).
Taking data from on-prem or IaaS sources like SQL on a VM, Oracle etc, requires a Self-Hosted Integration Runtime (SHIR).
Please review the Modern Data Warehouse pattern which sounds similar to what you are proposing.
Related
Unity Catalog is the Azure Databricks data governance solution for the Lakehouse. Whereas, Microsoft Purview provides a unified data governance solution to help manage and govern your on-premises, multicloud, and software as a service (SaaS) data.
Question: In our same Azure Cloud project, can we use Unity Catalog for the Azure Databricks Lakehouse, and use Microsoft Purview for the rest of our Azure project?
Update: In our current Azure subscription, we have divided workload as follows:
SQL related workload: we are doing all our SQL database work using Databricks only (no Azure SQL databases are involved). That is, we are using Databricks Lakehouse, Delta Lake, Deatricks SQL etc. to perform ETL and all Data Analytics work.
All Non-SQL workload: All other assets (Excel files, csv files, pdf, media files etc.) are stored in various Azure storage accounts.
MS Purview is doing a good job in scanning assets in scenario 2 above, and it easily creates a holistic, up-to-date map of our data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. It also enables our data consumers to access valuable, trustworthy data management.
However, our almost 50% of the work (SQL, ETL, Data Analytics etc.) is done in Azure Databricks where we have significant challenges with Purview. We were wondering if it's possible to keep Purview and Unity Catalog separate as follows: Purview does its Data Governance work for scenario 1 only and Unity Catalog does its Data Governance work for scenario 2 only.
This recently released update may resolve our issue of making Purview work better with Azure Databricks but we have not tried it yet: Connect to and manage Azure Databricks in Microsoft Purview (Preview)
As of right now there is no official integration between Unity Catalog and Purview yet, but it may come in the future. You may join Azure Databricks roadmap webinar that will be tomorrow to get more information.
Regarding the actual question - imho, nothing prevents you from using UC & Purview in the same Azure project.
P.S. You can get metadata & lineage information into Purview by loading data from information schema tables and using Purview APIs to store it in Purview.
Objective
I'm storing data as Delta Lake format at ADLS gen2. Also they are available through Hive catalog.
It's important to notice that we're currently using PowerBI, but in future we may switch to Excel over AAS.
Question
What is the best way (or hack) to connect AAS to my ADLS gen2 data in Delta Lake format?
The issue
There are no Databricks/Hive among AAS supported sources. AAS supports ADLS gen2 through Blob connector, but AFAIK, it doesn't support Delta Lake format, only parquet.
Possible solution
From this article I see that the issue may be potentially solved with PowerBI on-premise API gateway:
One example is the integration between Azure Analysis Services (AAS)
and Databricks; Power BI has a native connector to Databricks, but
this connector hasn’t yet made it to AAS. To compensate for this, we
had to deploy a Virtual Machine with the Power BI Data Gateway and
install Spark drivers in order to make the connection to Databricks
from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has
a more native Databricks connection.
The issue with this solution is that we're planning to stop using PowerBI. I don't quite understand how it works, what PBI license and implementation/maintenance efforts it requires. Could you please provide deeper insight on how it'll work?
UPD, 26 Dec 2020
Now, when Azure Synapse Analytics is GA, it has full support of SQL on-demand. That means that serverless Synapse may theoretically be used as a glue between AAS and Delta Lake. See "Direct Query Databricks' Delta Lake from Azure Synapse".
In the same time, is that possible to query Databricks Catalog (internal/external) from Synapse on-demand using ODBC? Synapse supports ODBC as external source.
Power BI Dataflows now supports Parquet files, so you can load from those files to Power BI, however the standard design pattern is to use Azure SQL Data Warehouse to load the file then layer Azure Analysis Service (AAS) over that. AAS does not support parquet, you would have to create a CSV version of the final table, or load it to a SQL Database.
As mentioned the typical architecture, is to have Databricks do some or all of the ETL, then have Azure SQL DW sit over it.
Azure SQL DW has now morphed into Azure Synapse, but this has the benefit of that a Databricks/Spark database now has a shadow copy but accessible by the SQL on Demand functionality. SQL on Demand doesn't require to to have an instance of the data warehouse component of Azure Synapse, it runs on demand, and you per per TB of query. A good outline of how it can help is here. The other option is to have Azure Synapse load the data from external table into that service then connect AAS to that.
Can someone explain the distinct difference between these two products in all major aspects? As far as I am aware from reading the official documents, both could host database systems and provide data cleaning pipeline? Both are on cloud?
Databricks:
Azure Databricks is an Apache Spark-based analytics platform optimized
for the Microsoft Azure cloud services platform. Designed with the
founders of Apache Spark, Databricks is integrated with Azure to
provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data scientists, data
engineers, and business analysts.
Synapse Analytics:
Azure Synapse is a limitless analytics service that brings together
enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless on-demand
or provisioned resources—at scale. Azure Synapse brings these two
worlds together with a unified experience to ingest, prepare, manage,
and serve data for immediate BI and machine learning needs
they do overlap to some extent, but they are not the same thing. Databricks is pretty much managed Apache Spark, whereas Synapse Analytics is managed SQL Data Warehouse.
I am new to Azure. I would like to learn the architecture deployed in my company which i shown below on diagram. Can anyone point me to some video example or something that could reflect that from diagram below. I also have access to Azure portal that i have some money credit so if it is possible i could create some test environment based on that diagram.
P.S Is it possible to use Visual Studio for any kind of work based on that diagram or everything have to be created and develop from Azure portal?
Datasource Oracle DB --> on prem gateway --> ADF--> Azure DB --> AAS --> PowerBI
SQL EDP --------------------------------------^
You've got a fairly straightforward BI architecture there with the following logical components:
raw / source data
integration
data mart / dimensional model
semantic
visualisation
The physical components look a bit like this:
The physical components can be described like this:
Oracle database - former market leader database product. I would guess your employers have rejected OBIEE for some reason
Self-hosted Integration Runtime (SHIR)On-premises data gateway - the SHIR gateway enables the movement of data from on-prem data sources to the cloud. This must be used when moving data from on-prem to Azure SQL DB using Data Factory. Use the SHIR with Data Factory and the Gateway with Power BI and Azure Analysis Services.
Data Factory - Azure ELT tool for moving data from place to place. ETL feature Data Flow currently in preview.
Azure SQL DB - PaaS SQL database, scalable via service tiers. If your data in Oracle is not already in a data mart / dimensional format, then it can be made so here
Azure Analysis Services (AAS) - PaaS OLAP in-memory engine, scalable for fast slice-and-dice, drill down and semantic modelling. Tabular only.
Power BI - increasingly powerful visualisation tool. Run dashboard in DirectQuery / LiveConnection mode to avoid entirely duplicating the tabular model from AAS in Power BI.
In answer to some of your questions: you can have one Azure Data Factory with many pipelines. The Visual Studio Azure Data Factory project type is now defunct.
As to "why" for certain technologies:
why Oracle - Who knows.
why SHIR - SHIR is compulsory when moving data from on-prem to cloud with ADF
why Azure SQL DB - lightweight and powerful PaaS DB requiring no infra and low TCO; scalable. Might be location for restructuring of data from raw / relational structure to dimensional in readiness for semantic layer if your data is not already in that format in Oracle
why AAS - fast, in-memory slice-and-dice; scalable, can pause, can be interrogated by Excel, Power BI Desktop, SSMS, VS, other clients etc. Optionally has row-level security (RLS)
Power BI - online service Power BI.com offers easy sharing within organisation, even externally.
why all the components together - you could (in theory) go straight from Oracle to Power BI with a Power BI gateway (I think) BUT you would then have to do all the modelling in Power BI and your model is then only really accessible from Power BI. In this model, users with SQL skills can query the data mart, users with DAX (or Excel, or Power BI Desktop) skills can query the AAS tabular model, AAS is very scalable component, etc
These opinions are strictly my own personal ones and the value of them may go down, as well as up.
HTH
Azure Data Factory has a 1:M capability with various data sources. One instance of Azure Data Factory will support multiple data movement capabilities: Data movement activities
Information about On-Premise Gateway:
The on-premises data gateway acts as a bridge, providing secure data transfer between on-premises data sources and your Azure Analysis Services servers in the cloud. In addition to working with multiple Azure Analysis Services servers in the same region, the latest version of the gateway also works with Azure Logic Apps, Power BI, Power Apps, and Microsoft Flow. You can associate multiple services in the same subscription and same region with a single gateway.
Connecting to on-premises data sources with Azure On-premises Data Gateway
I am using ADF to connect to sources and get data into Azure Data Lake store. After getting data into Data Lake Store, I want to do some transformation, aggregation and use that data in SSRS reports and also for creating Cubes.
Can anyone suggest me which will be the best option (Azure Data Lake Analytics or Azure SQL DW) ?
I am looking here to make a decision on to take which one after Data lake.
There are no more Azure SQL DW. What we have now are Azure Synapse (same as Azure DW) and Azure Synapse Analytics (instead of Azure Datalake analytics). Microsoft is stopping support (develop) USQL and Azure Datalake analytic. If volume of your data is huge and you want use Polybase technology the best choice is Azure Synapse and Azure Synapse Analytics. You can rich your ADF by using Databricks to do analytics stuff. By using Polybase you can do ELT instead of ETL.
Microsoft Azure is not anymore investing on Azure Data Lake Analytics (ADLA) , you can evidently see that number of enhancements /updates in last couple of years are almost none in ADLA. While on the other side Azure SQL Data Warehouse is their flagship service ( recently names as azure synapse analytics) and hence getting enhanced and updated very fast. Synapse is based on MPP architecture and provides all required capabilities of big data computing.
What is the size of your data? Azure Data Lake is more meant for petabyte size big data processing and Azure SQL Data Warehouse for large relational DWH solutions (starting from 250/500 GB and up).
With Azure Data Lake you can even have the data from a data lake feed a NoSQL database, a SSAS cube, a data mart, or go right into Power BI. With Azure SQL Datawarehouse you can have cubes, Power BI reports and SSRS
If you need SQL Server Reporting Services, Integration Services (and you have complex SSIS logic), and Analysis Services (SSAS), you may better consider an Azure SQL VM.