I am designing a data quality visualization dashboard where we will show the data quality score of a particular table/subject area by different dimensions. I have two options to build this visualization. I can use Databricks dashboard or Tableau. I wanted to know if there will be any limitations if we use Databricks Dashboard instead of Tableau.
The data quality dashboard will be used by both external and internal users.
Related
Azure Synapse Analytics is the datawarehouse solution from Azure.
There are 3 ways to load the data into the warehouse:
COPY statement
PolyBase
Bulk insert
The fastest and most scalable way to load data is through the COPY statement or the PolyBase.
However now it is also possible to load the data through Synapse Links. Which allows near-real time data.
But I do not see any documentation referring to Synapse Links being used in a traditional Data Warehouse for analytics.
The use cases in the documentation are:
Supply chain analytics, forecasting & reporting
Real-time personalization
Predictive maintenance, anomaly detection in IOT scenarios
Which are use cases that need real time data.
I do not need near real time data. Therefore I assume "Synapse Link" has some disadvantages for a traditional data warehouse solution.
Could someone please tell me their knowledge about using "Synapse Link" in a traditional analytics data warehouse ?
Thanks in advance
With "Traditional datawarehouse solution" I assume you have ETL processes, that load/refresh your DWH say once a day.
The Synapse Link is a very convenient way to import Cosmos DB or Dataverse Data into a Data Lake connected to Synapse. The "real time" part of it shouldn't bother you, because you can always use batch jobs (dataflows) to load the data periodically from the lake into your datawarehouse.
With the Synapse Link you save time and development effort to bring the data properly from the Cosmos DB or Dataverse into your analytical environment. It works great for us.
I have to create chart and metrices in Azure Monitor overview dashboard based on Azure SQL database data. How can we do that? I don't want to show database performance. I want to display database data in dashboard.
Azure Monitor helps you maximize the availability and performance of
your applications and services like CPU performance, memory usage,
storage usage, etc. It can't be used to visualize the stored data in
Azure SQL database or any other service.
To visualize the data, you should either use tools like Azure Data Explorer or Power BI.
Please check Visualize data with Azure Data Explorer dashboards.
We are in the process of analyzing which database will be the best choices for Time Series data (like stock market data / trading data, market sentiments ..etc.)
Is Azure Synapse is a good choice for Time Series Data?
Azure Synapse data explorer (Preview) provides you with a dedicated query engine optimized and built for log and time series data workloads.
With this new capability now part of Azure Synapse's unified analytics platform, you can easily access your machine and user data to surface insights that can directly improve business decisions.
To complement the existing SQL and Apache Spark analytical runtimes, Azure Synapse data explorer is optimized for efficient log analytics, using powerful indexing technology to automatically index structured, semi-structured, and free-text data commonly found in telemetry data.
For more info please refer to below related articles:
https://learn.microsoft.com/en-us/azure/synapse-analytics/data-explorer/data-explorer-overview
Time series solution - Azure Architecture
Please note that the feature is in public preview.
We have on prem SQL Server Analysis Services (SSAS) multidimension with lot of custom complex calculation, lot of measure group, complex model with many more features. We process per day few billion rows and have custom Excel add-in to connect custom pivot as well as standard Pivot table functionality used to create reports, run ad-hoc queries etc. and many more.
Below are the possible solutions in Azure
Approach 1: Azure Synapse, SSAS Multidimensional (ROLAP), Excel and Power BI. Note that SSAS Multidimensional will run as IaaS which will be host in VM. Desktop excel/excel 365 will be able to connect and Cloud Power BI.
Approach 2: Azure Synapse, Azure Analysis Services Tabular model direct query, Excel and Power BI. Desktop excel/excel 365 will be able to connect and Cloud Power BI.
Question:
Which approach will be based on the huge data volume, processing, complex logic, maintenance and custom calculation?
Can users access these cloud-based data cubes specially SSAS multidimensional either via their desktop Excel or via Excel 365?
How will be the performance ROLAP vs DAX on direct query mode?
What will be cost of moving and processing fairly large amounts of data?
With 12TB of data you will probably be looking at 500 - 1200GB of compressed Tabular model size unless you can reduce the model size by not keeping all of history, pruning unused rows from dimensions, and skipping unnecessary columns. That’s extremely large even for a Tabular model that’s only processed weekly. So I agree an import model wouldn’t be practical.
My recommendation would be a Tabular model. A ROLAP Multidimensional model still requires MOLAP dimensions to perform decently and your dimension sizes and refresh frequency will make that impractical.
So a Tabular model in Azure Analysis Services in DirectQuery mode should work. If you optimize Synapse well you should hopefully get query response times in the 10-60 second range. If you do an amazing job you could probably get it even faster. But performance will be largely dependent on Synapse. So materialized views, enabling query resultset cache, ensuring proper distributions and ensuring good quality Columnstore compression will be important. If you aren’t an expert in Synapse and Azure Analysis Services, find someone who is to help.
In Azure Analysis Services, ensure you mark relationships to enforce referential integrity which will change SQL queries to inner joins which helps performance. And keep the model and the calculations as simple as possible since your model is so large.
Another alternative if you want very snappy interactive dashboard performance for previously anticipated visualizations would be to use Power BI Premium instead of Azure Analysis Services and to do composite models. That allows you to create some smaller agg tables which are imported and respond fast to queries at an anticipated grain. But then other queries will “miss” aggs and run SQL queries against Synapse. Phil Seamark describes aggregations in Power BI well.
I am evaluating how to implement a Data Governance solution with Azure Data Catalogue for a Data Lake batch transformation pipeline. Below is my approach to it. Any insights please?
Data Factory can't capture the lineage from source to Data Lake.
I know Data Catalogue can't not maintain business rules for data curation on the Data Lake.
First the data feed is onboard manually from Azure Data Catalogue under a given business
glossary, etc. Or When raw data feed is ingested into Data Lake
Storage, the asset to be created automatically under a given business
glossary (if it does not exists).
The raw data is cleaned, classified and tagged during a light transformation on the lake. Thus, related tags needs to be created on Data Catalogue. (this is custom coding calling Azure Data Catalogue REST API's)
Then, there is ETL processing. New data assets to be created with tagging in Data
Catalogue. The tools are Spark based. (this is custom coding calling Azure Data Catalogue REST API's) Finally, Data Catalogue will display all data assets created in Data Lake batch transformation data pipeline under specific business glossary with the right tags.
I am skipping Operational meta-data and full lineage as there is no such
solution in Azure offerings. this needs to be custom solution again.
I am looking for the best practice. Appreciate your thoughts.
Many thanks
Cengiz