Databricks Unity Catalog multiple metastore for same region - databricks

We have 3 databricks workspaces , one for dev, one for test and one for Production. All these workspaces are in the same region WestEurope.
All of our data is in the datalake, meaning external tables in databricks references the data in the data lake (Azure data lake gen 2).
Each of these workspaces thus have a different datalake associated with it (as they are for different environments).
Now, this does not cater for the usual Unity Catalog use case, where you have multiple workspaces referring to the same metastore, as e.g. we would have different access requirements for each environment, along with data. In some cases, certain tables may exist in lower environments and not in Prod.
Also, looking here, I see the following sentence
You can create one metastore per region and attach it to any number of workspaces in that region.
All our Databricks workspaces (for different environments) are in the same region , but different subscription.
Is it then that Unity Catalog, does not apply for this use case? Because that would mean, we create 3 different metastore for the same region.
If not, then how can we get goodies like
Terraform capabilities which are only for unity catalog, e.g. create schema.
Data Lineage

This is how Unity Catalog works (at least right now) - each region may have only one Unity Catalog Metastore and all workspaces in that region could be attached to it.
Right now the problem with environment separation could be solved with the user groups. And you can set Azure Storage Firewall to limit access from the workspaces specific to a given environment.
And later this year there will be a feature that will allow to attach specific catalogs only to specific workspaces, so you can clearly separate environments. It was mentioned in the last quarter product roadmap and you can attend upcoming product roadmap webinar to get more updates about Unity Catalog.

Related

Using ksqldb to join data from multiple types of source connectors

We are evaluating ksqldb as a ETL tool for our organization. Our entire app is hosted on Microsoft Azure and mostly PaaS offerings are preferable in our organization. However 1 use case is that we have multiple microservices with their own databases and we want to join the tables in the databases together to produce some data in a denormalized format for some other tasks. An example would be Users table containing user data whereas Orders table contains all the orders. Users maybe in SQL format in MySQL whereas Orders maybe in NoSQL format in MongoDB. Now we need to generate some report on by joining Orders and Users tables together based on user_id. This can be done in ksqldb by using some joins on streams/tables and adding source connectors to each of the databases. Then we can write a sink connector to a new MongoDB database that can have the joined Users_Orders info. So if new data is added and the connectors and joins are running our joined data in Users_Orders will also get updated.
With Azure Event Hub I read that using ksqldb in production will not be possible because of some licensing issues. So my questions are:
Before going into some other products like Azure HDInsights or Confluent Cloud is there any way of running ksqldb to achieve the same solution (perhaps like managing your own Kafka cluster)?
You don't necessarily need ksql; you should be able to do something similar with SparkSQL, offered in Azure (Databricks). You don't necessarily need Kafka / EventHub either since Spark could read, join, and write Mongo/JDBC data all on its own (with the appropriate plugins).
The main reason ksqlDB isn't a hosted service by Azure, is that it conflicts with Confluent Licensing, but that does not prevent you from running it yourself, as long as you also adhere to the licensing restrictions of not publicly offering the ksqlDB REST API as a publicly available / paid API. I've not personally tried, but ksqlDB should work against EventHubs on its own, I don't think you need to self manage Kafka as the documentation suggests.

Azure Synapse Workspace Disaster Recovery considerations

We are working through some DR scenarios. Is the management plane / metadata for a Synapse workspace geo-redundant? E.g. the workspace definition, the list of notebooks, the currently deployed pipelines, monitoring run history etc? I appreciate the data has backups for the deployed region etc but I am interested in the actual workspace metadata/definition.
The scenario we are interested in is if a local Azure data center ceases to operate, say due to a natural disaster, Will the Synapse workspace continue to operate and show the deployed workbooks, pipelines, run history of previous jobs etc? From the documentation it seems like the data will need to be restored and possibly linked services reconnected, but I cant find anything on the workspace metadata / management plane.
If anyone could point me to docs or add clarity, that would be great.

Data Migration from Snowflake (on GCP Instance) to Snowflake (Azure Instance)

I am looking for some inputs on how to do a GCP cloud to AZURE cloud data migration.
Scenario -
I have a snowflake instance configured on GCP cloud (multiple databases holding legacy data) and I have another snowflake instance configured on Azure Cloud (DWH created on this instance).
I want to move/copy the data of all the databases (including all child objects - schema, table, views etc) sitting on GCP snowflake instance to snowflake instance configured on Azure Cloud.
Can you please guide me on what can be the best solution for such data migration and any steps or documentation link would be really helpful.
Many thanks - Minti
Please check the Database replication mechanism which can be used as a migration tool for SF account from 1 cloud platform to another. https://docs.snowflake.com/en/user-guide/database-replication-intro.html
Not something I've done before to be honest but if you didn't want to use external tools one possible method would be to secure share your GCP databases with your Azure Snowflake account.
You then might be able to create a new database that is a clone of this share (not sure if this is possible).
Most objects get cloned apart from stages and pipes but tables, views etc should carry over
This is a pretty easy process with a couple of prerequisites.
Make sure you have Organizations enabled on your GCP account.
This feature allows you to self-provision Snowflake accounts on any cloud provider/region. Open a support case to enable it.
Introduction to Organizations
Create a new account on Azure if you haven't already.
Enable Replication on both accounts
This can be done when logged into the account with the ORGADMIN role
Replicate your databases
Note: this will work for having a replica of the databases in the GCP Snowflake account databases in your Azure Snowflake account. If you want to permanently migrate your databases you need to set up Failover/Failback. This is a Business Critical feature, but Snowflake support will enable it for lower editions until you can complete your migration, at which point they will disable it.
Replicating a Database to Another Account
There are two options
You could make use of the replication feature
High level Steps include the below
a. Target account to be created - Can use the Organizations feature available in Snowflake(Enabled by Snowflake Support upon request)
b. Account level objects should be created manually in the target account
Note: The Failover feature is supported for the accounts whose edition is Business-critical and above. However, for account migration scenarios, this feature will be enabled for a temporary period by the Snowflake Support.
c. Replication - the below links can be referenced for a complete understanding of the process.
https://docs.snowflake.com/en/user-guide/database-replication-intro.html#introduction-to-database-replication-across-multiple-accounts
https://docs.snowflake.com/en/user-guide/database-replication-config.html#replicating-a-database-to-another-account
https://docs.snowflake.com/en/user-guide/database-failover-config.html#failing-over-databases-across-multiple-accounts
Please find the link below to have an overview on the costs associated
https://docs.snowflake.com/en/user-guide/database-replication-billing.html#understanding-billing-for-database-replication
Limitations
https://docs.snowflake.com/en/user-guide/database-replication-intro.html#current-limitations-of-replication
One other option is to create the target account and use the unloading and loading feature
https://docs.snowflake.com/en/user-guide-data-unload.html
https://docs.snowflake.com/en/user-guide-data-load.html

Relationships in Azure synapse (DWH)

I'm currently working in Azure synapse DWH and I have some theoretical questions:
How I can create relationships between tables (Dim's and Fact's) and what implications I would have If I want to create those relationships.
I read that To create a primary key, I would need to set a nonclustered table, but what that means?
Azure Synapse Analytics (ASA) has three engines:
serverless SQL pools (was SQL on-demand)
dedicated SQL pools (the next step on from Azure SQL Data Warehouse)
Apache Spark pools
None of these currently support database relationships, as at today. I suspect you mean dedicated SQL pools and just to confirm it does not support the FOREIGN KEY syntax. Relationships is more of an OLTP concept and not common in big data platforms, which ASA is.
Therefore your options are to enforce these relationships downstream or on import to your warehouse. A common method is to identify unknown values and substitute them with a -1 / Unknown value on import. This will ensure there are no NULLs in your key columns.
Additionally, enforce your relationships downstream eg in an Azure Analysis Services tabular model or Power BI model.
If you really need relationships then depending on your data volumes you might consider Azure SQL Database which supports data volumes up to 4TB alongside columnstore indexes which give great compression.
having a similar issue:
I cannot find an automated solution thus far;
I'm importing 'entities' from D365 to datalake; and it does NOT come with the relationships.
it will also NOT suggest the "Related Tables"
Introduce; ETL of 'entities' using T-SQL and Spark.
Governance of:: py.spark, notebooks, Schema, linting T-SQL. orchestration of activities and pipelines, workflows. Etc...
OR
For small datasets and projects:
Reverse look-up each table needed.
In Azure Synapse create a new DataFlow; and download the .PBIX ;
Do your ETL: Create Primary fact and dimension tables; (by whatever means), such as Using PowerPivot Unique/distinct DAX expression on a Customer.Table).
Once complete; if you like; import the newly ETL primary tables to the datalake.
Repeat step 2.
Create the relationships with PowerBI. (Ideally if ETL is done correctly PBI will auto find the relationships)
RE-Publish the .PBIX with the relationships as a “DataFlow”.
a. You must create relationships for every Dataflow; dataflows cannot be combined.
Measures and Dataflows will consume resources and require performance analysis if they grow.
at some point 'dataverse' may allow D365 data making this easier.
depending on your 'cost/spend' cloning all of D365 still doesn't solve your relationship needs.
Two solutions I'm aware of thus far:
Import the serverless DBO's to PowerBI; Model and Create the Dataset there. you can do massive ETL, including Foreign Key creation, and Filtering of NULL values to create primary keys for Dimensions. Aggregate data and create Fact tables, etc...
Its far easier then using the Synapse GUI. Drawbacks are PBI licensing related.
Create a "Lake Database" (map as you go, great for 5 or less entities.tables.) ETL is low-code. But I'm skeptical that after 40 hours of training; I should have just learned how to scrip this in Workbook/Spark.
Do BOTH; use PowerBI to develop your model and test it. Then go back to synapse and deploy the working model as a pipeline or lake database.
Points of Clarity from the top posted solution:
Do not trust the auto-relationship of PowerBI; stay away from pre-made REFID relationships in PBI unless you know for sure this is what you want. (step 6: original poster; if ETL is correct its a 1:M)
Publishing with .PBIX has its limitations with sharing and other issues the OP mentioned. Lake Database might be the workaround if you have Tabelau, Python, or Qlik as your solution.
DataVerse is coming; and PBI Analytics as well as predictive analysis with HD Insights will be embedded into D365. You will also be able to create drag and drop dashboards. As of 08-05-2022 this is already working in its infancy; even thought they want you to go modular; with hybrid serverless setup you can STILL Pull the aggregate measures from D365 into synapse and Reverse engineer them.

Onpremise Databases to Azure SQL Databases and Sync continuously

My requirements are as below :
Move 3 SAP local databases to 3 Azure SQL DB.
Then Sync daily transactions or data to azure every night. If transactions of local DB are already exists in azure, update process will do on these transactions if not insert process will do.
Local systems will not stop after moving to azure. They will still goes about 6 months.
Note :
We are not compatible with Azure Data Sync process because of it's
limitations - only support 500 tables, can't sync no primary keys
table, no views and no procedure. It also increase database size on
both(local and azure).
Azure Data Factory Pipeline can fulfill my requirements but I have
to create pipeline and procedure manually for each table. (SAP has
over 2000 tables, not good for me)
We don't use azure VM and Manage Instance
Can you guide me the best solution to move and sync? I am new to azure.
Thanks all.
Since you mentioned that ADF basically meets your needs, I will try to start from ADF. Actually,you don't need to manually create each table one by one.The creation could be done in the ADF sdk or powershell script or REST api. Please refer to the official document:https://learn.microsoft.com/en-us/azure/data-factory/
So,if you could get the list of SAP table names(i found this thread:https://answers.sap.com/questions/7375575/how-to-get-all-the-table-names.html) ,you could loop the list and execute the codes to create pipelines in the batch.Only table name property need to be set.

Resources