DBT on Databricks Unity Catalog - databricks

I've been considering turning on Databricks Unity Catalog in our primary (only) workspace, but I'm concerned about how this might impact our existing dbt loads with the new three-part object references.
I see from the dbt-databricks release notes that you need >= 1.1.1 to get unity support. The snippet with it only shows setting the catalog property in the profile. I was planning on having some of the sources in separate catalog's for the dbt generated objects.
I might even choose to have the dbt generated objects in separate catalogues if this was available.
As turning on Unity Catalog is a one way road in a workspace, I don't wish to wing it and see what happens.
Has anyone used dbt with Unity Catalog and used numerous catalogs in the project?
If so, are there any gotcha's and how do you specify the catalog for sources and specific models?
Regards,
Ashley

specifying 2-part object in schema indeed causes problems, at least in incremental models, instead specify catalog
sql-serverless:
outputs:
dev:
host: ***.cloud.databricks.com
http_path: /sql/1.0/endpoints/***
catalog: hive_metastore
schema: tube_silver_prod
threads: 4
token: ***
type: databricks
target: dev

Thanks Anton, I ended up resolving this. I created a temporary workspace to test it before applying to the main workspace. The catalog attribute can be applied almost anywhere you can specify the schema attribute, not just the profile.yml. I now have a dbt project which targets multiple catalog's. These are set in the dbt_project.yml at appropriate model level.

Related

CI / CD with Databricks Unity Catalog

I am migrating tables from hive_metastore to Unity Catalog for my Databricks workspaces.
I have three databricks workspaces:
Dev
Test
Prod
Each workspace has its own ADLSv2 storage account. (Dev, test, prod)
Currently when developing I read in a table using
df = spark.table('bronze.my_table') # schema.table
This uses the default hive_metastore which points to the corresponding container (Workspace Dev -> Storage account Dev).
However, with Unity Catalog. It seems I would now have to specify the catalog too based on which workspace I work in. Unless, there is a default unity catalog for a workspace.
df = spark.table('dev.bronze.my_table') # catalog.schema.table
When deploying code from Dev -> Test -> Prod workspace. I would like to avoid having to dynamically set the catalog name for all notebooks using spark.table based on workspace (dev, test, prod). Basically 'bronze.my_table' when working in Dev points to delta table data stored in the dev catalog. While in Prod it points to delta table data stored in the prod catalog. Is this possible? I assume I can use the previous hive_metastore (one for each workspace) and build Unity Catalog on top of it (they reference each other and are in sync). However, isn't the idea that the Unity Catalog replaces the hive_metastore?
There are few approaches to this:
At the beginning of your program issue the use catalog catalog_name SQL command, and then you can continue to use two-level naming for schema+table inside the catalog - continue to use df = spark.table('bronze.my_table')
Incorporate the catalog name variable into table name, like, df = spark.table(f'{catalog_name}.bronze.my_table').
In all cases you need to either explicitly pass the catalog name as command-line option or widget or something like that, or try to map workspace URL to environment.
But really, it's recommended to pass table names as configuration parameters, so you can easily switch not only between catalogs, but also between schemas/databases.

How to automate my relational database operations using ci/cd and gitlab?

I want to automate my RDB. I usually use SQLDeveloper to compile, execute and save my PL SQL scripts to the database. Now I wish to build and deploy the scripts directly through gitlab, using ci/cd pipeline. I am supposed to use Oracle Cloud for this purpose. I don't know how to achieve this, any help would be greatly appreciated.
Requirements: Build and deploy PL-SQL scripts to the database using gitlab, where the password and username for the database connection are picked from vault on the cloud, not hardcoded. Oracle cloud should be used for the said purpose.
If anyone knows how to achieve this, please guide.
There are tools like Liquibase and Flyway. Those tools do no do miracles.
Liquibase has a list of changes (XML or YAML) to be applied on a database schema (eventually with undo step).
Then it has a journal table in each database environment, so i can track which changes were applied and which were not.
It can not do mighty schema comparisons like SQL Developer or Toad does.
It also can not prevent situations where applied DML change on prod database goes kaboom, because the DML change was just successfully tested on 1000x smaller data set.
But yet it is better than nothing and it can be integrated with ansible/gitlab and other CI/DC tools.
You have a functional sample, using Liquibase integration with sqlcl in my project Oracle CI/CD demo.
To be totally honest
It's a little out-of-date, because I use a trick for rollback because in the moment of writting, Liquibase tagging was not supported. Currently it's supported
The final integration with Jenkins is not done, but it's obvious

Another Azure ML bug caused by new Compute Common Runtime

Many of my Azure ML Studio Designer pipelines began failing today. I was able to make a minimum repro:
Simply excluding columns with the Select Columns In Dataset node will fail with a JobConfigurationMaxSizeExceeded error.
This appears to be a bug introduced by Microsoft's rollout of their new Compute Common Runtime.
If I go into any nodes failing with the JobConfigurationMaxSizeExceeded exception and manually set AZUREML_COMPUTE_USE_COMMON_RUNTIME:false in their Environment JSON field, then they will subsequently work correctly. This is not documented anywhere that I could find, I stumbled over this fix through trial-and-error, and I wasted many hours trying to fix our failing pipelines today.
Does anyone know where I can find a list of possible effects of the Compute Common Runtime migration in Azure ML? I could not find any documentation on this and/or how it might affect existing Azure ML pipelines.
runtime environment variable should be set on the run configuration. property on the Environment object is deprecated.
https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.runconfig.runconfiguration?view=azure-ml-py#variables

how to setup catalog synchronization

i'm trying to setting up catalog replication staged->online I've created the sync job, that finishes successfully, but in online catalog nothing is created.
Any suggestion?
Thanks Marco
In general the SetupSyncJobService is used to create SyncJobs. Maybe its better to use that class to create your sync job?
Did you set the root types? Root types are all item types that should be copied from staged to online catalog. So if you want to copy products from staged to online catalog, there should be an entry "Product" in the root types list. The SetupSyncJobServices creates all root types for you so you don't need to bother. Perhaps you can compare the setup of your sync job to another sync job set up by the SetupSyncJobService and match the configuration to it.

Where is the list of deployment template schema api versions?

We are authoring Azure Resource Manager templates. We are using the following deployment template schema, because it is the one that we saw in an example.
http://schema.management.azure.com/schemas/2014-04-01-preview/deploymentTemplate.json#
It is from early 2014. Where can we find a list of more recent schema versions?
We have looked at the list of resource Manager providers, regions, API versions and schemas. It references a schema for each provider not for the entire template.
When we do find a list of more recent schema, how do we evaluate which deployment template schema to use? Is more recent better?
Here is our current hack:
Go to https://github.com/Azure/azure-resource-manager-schemas
Press t to enter the [GitHub File Finder][3].
Type DeploymentTemplate.
Voila. We have a list of deployment template schema, which displays two API versions.
More recent is better. But in general you should be able to stick with the top level schema of:
http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#
That will pull in the proper version of all the child schemas. We update the child schemas so all your existing templates don't have to be updated. Multiple API versions are supported in the child schemas to support "backward compat".
If you do peruse GH, look at the readme.md (that tells you what to test and therefore what's in use) and the file you want to watch is:
https://github.com/Azure/azure-resource-manager-schemas/blob/master/schemas/2015-01-01/deploymentTemplate.json
As that's the top level schema file.
Still no official list in 2020, until we found one, here's the current root schemas for quick reference
Resource group:
https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#
https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#
https://schema.management.azure.com/schemas/2014-04-01-preview/deploymentTemplate.json#
Subscription:
https://schema.management.azure.com/schemas/2018-05-01/subscriptionDeploymentTemplate.json#
Management Group:
https://schema.management.azure.com/schemas/2019-08-01/managementGroupDeploymentTemplate.json#
Tenant:
https://schema.management.azure.com/schemas/2019-08-01/tenantDeploymentTemplate.json#
This will definitely be outdated in the future, this is sourced from here so be sure to check too if you want the latest. Feel free to update the list in the future.
I was searching for same answer, found this question.
Sorry for all those who answered before, i wasn't satisfied with the proposed solutions.
So i found another way, maybe this is suitable :)
At this page https://learn.microsoft.com/en-us/azure/templates/
you'll find on the left side a list of all types of resources that can be defined in an ARM template.
For each resource (e.g. CosmosDB) you'll find a link with All resources (e.g. https://learn.microsoft.com/en-us/azure/templates/microsoft.documentdb/allversions for CosmosDB) which lists all versions for that resource.
Hope it helps!
p.s.: also there's the link of Latest (e.g. for CosmosDB https://learn.microsoft.com/en-us/azure/templates/microsoft.documentdb/databaseaccounts) which gives the latest format of that resource ;)

Resources