Am currently creating a datawarehouse in Azure Synapse, however Synapse does not allow for the creation of foreign keys. This is vital for referential integrity between the fact and dimension table. Does anyone have any suggestions as to what the alternatives are in synapse to enforce a PK FK relationship?
I searched about this topic and I found that the focus of Synapse is performance and not integrity reinforcement. We can create primary keys and structure the star schema with fact, dimensions and code join tables between them.
It was confused me too until I make this tutorial and read this carefully.
Load Contoso retail data to Synapse SQL
In a star schema any referential integrity should be enforced within the ETL tool used to load the data and not in the DB itself.
Some DBs support logical FKs that can help with query execution plans but they should never be physicalised
Related
I'm new to Azure synapse analytics and I'm trying to crate internal tables to create my DW.
I have just found out that we cannot use primary key/foreign key.
My question is, how do I relate my fact tables with my dimension tables? is it necessary to create any kind of relationship between them or just create it plainly and use joins whenever needed?
I suggest creating your tables with the primary key identified, it helps the optimiser make better query plan choices.
There is currently no syntax supported for foreign key identification.
As you suggested, just go ahead and write your joins.
I'm currently working in Azure synapse DWH and I have some theoretical questions:
How I can create relationships between tables (Dim's and Fact's) and what implications I would have If I want to create those relationships.
I read that To create a primary key, I would need to set a nonclustered table, but what that means?
Azure Synapse Analytics (ASA) has three engines:
serverless SQL pools (was SQL on-demand)
dedicated SQL pools (the next step on from Azure SQL Data Warehouse)
Apache Spark pools
None of these currently support database relationships, as at today. I suspect you mean dedicated SQL pools and just to confirm it does not support the FOREIGN KEY syntax. Relationships is more of an OLTP concept and not common in big data platforms, which ASA is.
Therefore your options are to enforce these relationships downstream or on import to your warehouse. A common method is to identify unknown values and substitute them with a -1 / Unknown value on import. This will ensure there are no NULLs in your key columns.
Additionally, enforce your relationships downstream eg in an Azure Analysis Services tabular model or Power BI model.
If you really need relationships then depending on your data volumes you might consider Azure SQL Database which supports data volumes up to 4TB alongside columnstore indexes which give great compression.
having a similar issue:
I cannot find an automated solution thus far;
I'm importing 'entities' from D365 to datalake; and it does NOT come with the relationships.
it will also NOT suggest the "Related Tables"
Introduce; ETL of 'entities' using T-SQL and Spark.
Governance of:: py.spark, notebooks, Schema, linting T-SQL. orchestration of activities and pipelines, workflows. Etc...
OR
For small datasets and projects:
Reverse look-up each table needed.
In Azure Synapse create a new DataFlow; and download the .PBIX ;
Do your ETL: Create Primary fact and dimension tables; (by whatever means), such as Using PowerPivot Unique/distinct DAX expression on a Customer.Table).
Once complete; if you like; import the newly ETL primary tables to the datalake.
Repeat step 2.
Create the relationships with PowerBI. (Ideally if ETL is done correctly PBI will auto find the relationships)
RE-Publish the .PBIX with the relationships as a “DataFlow”.
a. You must create relationships for every Dataflow; dataflows cannot be combined.
Measures and Dataflows will consume resources and require performance analysis if they grow.
at some point 'dataverse' may allow D365 data making this easier.
depending on your 'cost/spend' cloning all of D365 still doesn't solve your relationship needs.
Two solutions I'm aware of thus far:
Import the serverless DBO's to PowerBI; Model and Create the Dataset there. you can do massive ETL, including Foreign Key creation, and Filtering of NULL values to create primary keys for Dimensions. Aggregate data and create Fact tables, etc...
Its far easier then using the Synapse GUI. Drawbacks are PBI licensing related.
Create a "Lake Database" (map as you go, great for 5 or less entities.tables.) ETL is low-code. But I'm skeptical that after 40 hours of training; I should have just learned how to scrip this in Workbook/Spark.
Do BOTH; use PowerBI to develop your model and test it. Then go back to synapse and deploy the working model as a pipeline or lake database.
Points of Clarity from the top posted solution:
Do not trust the auto-relationship of PowerBI; stay away from pre-made REFID relationships in PBI unless you know for sure this is what you want. (step 6: original poster; if ETL is correct its a 1:M)
Publishing with .PBIX has its limitations with sharing and other issues the OP mentioned. Lake Database might be the workaround if you have Tabelau, Python, or Qlik as your solution.
DataVerse is coming; and PBI Analytics as well as predictive analysis with HD Insights will be embedded into D365. You will also be able to create drag and drop dashboards. As of 08-05-2022 this is already working in its infancy; even thought they want you to go modular; with hybrid serverless setup you can STILL Pull the aggregate measures from D365 into synapse and Reverse engineer them.
According to this documentation Primary key, foreign key, and unique key in Synapse SQL pool, Azure Synapse doesn't support foreign keys.
Thus, how can I model the star schema in Azure Synapse? Or is it not necessary on Azure Synapse?
As per the documentation you have linked:
Having primary key and/or unique key allows Synapse SQL pool engine to
generate an optimal execution plan for a query
After creating a table with primary key or unique constraint in
Synapse SQL pool, users need to make sure all values in those columns
are unique. A violation of that may cause the query to return inaccurate result.
The take away from the above is that Synapse DB does not support constraints. PK and Unique "constrains" are purely for query performance optimisation and do not actually enforce uniqueness.
I have an Azure SQL database with many tables that I want to update frequently with any change made, be it an update or an insert, using Azure Data Factory v2.
There is a section in the documentation that explains how to do this.
However, the example is about two tables, and for each table a TYPE needs to be defined, and for each table a Stored Procedure is built.
I don't know how to generalize this for a large number of tables.
Any suggestion would be welcome.
You can follow my answer https://stackoverflow.com/a/69896947/7392069 but I don't know how to generalise creation of table types and stored procedures, but at least the metadata table of the metadata driven copy task provides a lot of comfort to achieve what you need.
I'm trying to find the best solution for storing dynamic spatial data. I wonder if any of Microsoft's Azure solutions could work. Azure Table Storage would let me create a lot of custom and dynamic structures stored on fast SSD disks.
Because of data's dynamic nature, common indexing seems useless. I would also like to create a lot of table-like structures so the whole architecture cannot be static. Using Azure Table Storage I would dynamically create a table based on country, city, etc sorted by latitude or longitude.
I would appreciate any clue.
Azure Table Storage has mostly been replaced by Azure Cosmos DB.
At the time of writing the Table Storage page even says:
The content in this article applies to the original basic Azure Table storage. However, there is now a premium offering for Azure Table storage in public preview that offers throughput-optimized tables, global distribution, and automatic secondary indexes. To learn more and try out the new premium experience, please check out Azure Cosmos DB: Table API.
You can use Cosmos DB via the Table API, but you'll probably find the Document DB API to be more powerful.
Documents are "schema-free". You can just throw your documents in to a collection, and then you can query against them.
You can create documents which have geo-spatial properties which are indexed automatically.
Then you can perform geo-spatial queries against those properties.
For example you might give each of your documents a point, and then create a query to select all documents that are inside of a polygon.
Or maybe you want to find out how far away each document is from a given point.