Is Azure Synapse Link a good way of loading the data in a Data Warehouse? - azure

Azure Synapse Analytics is the datawarehouse solution from Azure.
There are 3 ways to load the data into the warehouse:
COPY statement
PolyBase
Bulk insert
The fastest and most scalable way to load data is through the COPY statement or the PolyBase.
However now it is also possible to load the data through Synapse Links. Which allows near-real time data.
But I do not see any documentation referring to Synapse Links being used in a traditional Data Warehouse for analytics.
The use cases in the documentation are:
Supply chain analytics, forecasting & reporting
Real-time personalization
Predictive maintenance, anomaly detection in IOT scenarios
Which are use cases that need real time data.
I do not need near real time data. Therefore I assume "Synapse Link" has some disadvantages for a traditional data warehouse solution.
Could someone please tell me their knowledge about using "Synapse Link" in a traditional analytics data warehouse ?
Thanks in advance

With "Traditional datawarehouse solution" I assume you have ETL processes, that load/refresh your DWH say once a day.
The Synapse Link is a very convenient way to import Cosmos DB or Dataverse Data into a Data Lake connected to Synapse. The "real time" part of it shouldn't bother you, because you can always use batch jobs (dataflows) to load the data periodically from the lake into your datawarehouse.
With the Synapse Link you save time and development effort to bring the data properly from the Cosmos DB or Dataverse into your analytical environment. It works great for us.

Related

Is Azure Synapse is a good choice for Time Series Data?

We are in the process of analyzing which database will be the best choices for Time Series data (like stock market data / trading data, market sentiments ..etc.)
Is Azure Synapse is a good choice for Time Series Data?
Azure Synapse data explorer (Preview) provides you with a dedicated query engine optimized and built for log and time series data workloads.
With this new capability now part of Azure Synapse's unified analytics platform, you can easily access your machine and user data to surface insights that can directly improve business decisions.
To complement the existing SQL and Apache Spark analytical runtimes, Azure Synapse data explorer is optimized for efficient log analytics, using powerful indexing technology to automatically index structured, semi-structured, and free-text data commonly found in telemetry data.
For more info please refer to below related articles:
https://learn.microsoft.com/en-us/azure/synapse-analytics/data-explorer/data-explorer-overview
Time series solution - Azure Architecture
Please note that the feature is in public preview.

Why would I not used Databricks as my data mart?

I'm trying to get my head around Databricks.
I've found documentation stepping through importing data from S3 or Azure Datalake, and then outputting into Azure Synapse Analytics or another Data Warehouse solution.
After a quick play, I've recognised that you can simply save a table in Databricks, access it using SQL, and even pull it into PowerBI as a source.
So my question: for a small Datamart (10 dims, 5 facts), why would I choose to pay for an additional database solution like Azure SQL, Synapse, RDS or other when I could simply leave the data in a table in Databricks and then access it directly from my reporting tool from there?
Thank you in advance.
Andy
Yes this is very much possible . Just to let you know that SQL Azure and Synapse may be a Microsoft offering but they are for different purpose , Synapse supports MPP and so it more big data implementation . Also its not only how many dimension and fact table you have , how much data you have , what kind of aggregation it has etc becomes decisive .

Azure Data Explorer (ADX) vs Polybase vs Databricks

Question
Today I discovered another Azure service called Azure Data Explorer (ADX). Sorry for such comparison of services, I have good understanding of all except ADX. I feel like there is a big functionality overlay, so want to know the exact role of ADX in Azure infrastructure.
What is the use case when ADX is significantly better than Synapse/Databricks?
My understanding of ADX
AFAIK, ADX is a cluster (with per hour billing, like Databricks or Synapse, not like ADLA) that is handling database for you and is optimized for streaming ingestion and ad-hoc queries at scale. It also supports external tables, that has worse performance but cheaper (you pay for Blob/ADLS storage).
Details
I don't understand why do we need ADX if:
Azure Synapse has similar pricing model (cluster, per-hour), also it supports streaming ingestion and ad-hoc querying at scale. Azure Synapse support querying BlobStorage/ADLS through Polybase external tables.
Databricks is another service that is capable of doing it. Using Databricks Ingest and Delta Lake - you can ingest streaming data and consume them in both: streaming and batching way. Actually you can have interactive cluster that will handle ad-hoc queries for you.
Also if you want a real-time analytics - use Azure Stream Analytics. If you want Athena-like experience - use ADLA (still it doesn't support ADLS gen2).
Azure Data Explorer is focused on high velocity, high volume high variance (the 3 Vs of big data). It provides super fast interactive queries over such data that is streaming in. It supports json and text natively, including full text search and indexing.
It is used in a broad set of scenarios associated with sensing activity and time series in a large set of verticals: IoT, API logs, transaction monitoring and ad hoc data exploration.
Microsoft is offering ADX as a service as it is the major service that Microsoft is using for its own telemetry and all the analytical solutions as a service that we offer in Security, operational monitoring, game analytics, product insights usage analytics, Iot, Connected vehicles is built on ADX. You can find a full list in our docs. For clarity, SQL, Synapse, CosmosDB is storing its telemetry in Azure Data explorer...
SQL DW (AKA Synapse SQL pool) is an excellent data warehouse and implements the modern data warehouse pattern. ETL->Curated data model-> Load and serve via analysis services or power BI.
ADX is for real time analytics, enabling applying schema on read (SOR) on data as fresh as seconds old.
Consider ADX as a fully managed platform when replacing SOLR/Lucine based variants used for logs, time series databases and more.
Try it out in large workloads and you will see it is dramatically cheaper than the alternatives and much more powerful and performant.
Reach out to me if you need help.
Azure Data Explorer alias Kusto is focused on high volume data ingestion and almost real-time query and analytics. It is invented at Microsoft for log and telemetry analytics, but can be used for other purposes e.g. Iot, sensor data or web analytics. Same technology is used in Azure internal services like Azure Monitor and Log Analytics.
Similar capabilities could be build on Synapse or Databricks or HDInsight, but I see these as tools that fit much more broad use-cases. ADX has quite narrow focus. ADX does support queries (”KQL”) but has very limited SQL support. It is good for append only data, not for updates. It is not a data warehouse, database or data lake.
Microsoft material refers to the technology behind ADX with name Kusto. More info on this at https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/. A good comparison of services can be found in this blog post: https://vincentlauzon.com/2020/02/19/azure-data-explorer-kusto

Azure Data Lake for Structured Data

We've been reviewing the Modern Data Warehouse architectures from Microsoft (link here), which references using Azure Data Factory to pull structured and unstructured data into the Azure Data Lake. I've attended a lot of presentations on the subject as well, but most people are split on whether the Data Lake is a good home for structured data. What I am trying to determine is if importing data into the Data Lake is a good strategy if the only source we will be utilizing is on-prem SQL Server databases? And, what would be the advantage / disadvantages of that strategy?
For context sake, we're looking for a single pane of glass for consumption - whether it's end user's reporting with Power BI, or fodder for Azure Data Warehouse / on-prem Data Warehouse. We want one container that is the source for all of these systems, which is not the source OLTP system (i.e. OLTP database --> (Azure Data Factory) --> Data Lake --> everything else).
I appreciate any guidance on the subject. Thank you.
You have not mentioned the data size and I think for moving to ADL , the data is a very strong parameter . In your case the data is very much structured . If you we had unstructured & massive data and if you wanted to use ADB or Hadoop or any other technology to process it later , i think ADL is a good candidate .
You should also consider that the data is encrypted in motion using SSL .You can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls .
The only real value in taking stuctured data, flattening it and loading it into a data lake is to save cost and decouple the data from any proprietary tool/compute. In your scenario, it will be less expensive to store the data in a data lake store vs. Azure SQL Database.
However, there is a complexity cost to flattening the data. You will need to restructure the data (ie. load it back into a database, or wrap logical structure) when you need to consume the data. Formats such as Parquet will help with this, but it is more complex for users to query data in a datalake than it is to connect to a relational database. Most all analysts and data consumers will know how to query a relational database, especially if the data is already in SQL Server.
Look at the volume of data and use cases for consumption to make that decision. A "logical datalake" can include both structured data in a relational database, semi structured data flattened in a storage account, and unstructured data saved to a storage account.

Reading data from lake

I need to read data from azure data from azure data lake and apply some joins in sql and show in Web UI.
Data is around 300 gb and migrating data from azure data factory to azure sql database is happening at the speed of 4Mbps.
I have also tried to use sql server 2019 which has polybase support but that is also taking 12-13 hours to copy data.
Also tried cosmos db for storing data from lake but seems it is taking large amount of time.
Any other way we can read data from lake.
One way can be azure data warehouse,but that is too costly and support only 128 concurrent transactions.
Can databricks be used,but its a computation engine and we need it to be available 24*7 for UI Queries
I still suggest you using Azure Data Factory. As you said, your data is around 300 gb.
Here's the Copy performance and scalability achievable using ADF:
I agree with David Makogon. The performance of your Data Factory is very slowly( 4Mbps). Please reference this document Copy activity performance and scalability guide.
It will help you improve the Data Factory data copy performance, give more suggestions about Data Factory settings or Database settings.
Hope this helps.
I had a very similar situation, just more data +-900GB.
If you need to show it in ui, you will still need to load data to Azure SQL, as DWH is not very good at handling parallel load and its costy.
We ended up using bulk insert from blob storage.
I created sp to call bulk insert with parameters (source file, target table) and ADF to orchestrate and run in parallel.
Could not find anything faster than that.
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15

Resources