CI / CD with Databricks Unity Catalog - databricks

I am migrating tables from hive_metastore to Unity Catalog for my Databricks workspaces.
I have three databricks workspaces:
Dev
Test
Prod
Each workspace has its own ADLSv2 storage account. (Dev, test, prod)
Currently when developing I read in a table using
df = spark.table('bronze.my_table') # schema.table
This uses the default hive_metastore which points to the corresponding container (Workspace Dev -> Storage account Dev).
However, with Unity Catalog. It seems I would now have to specify the catalog too based on which workspace I work in. Unless, there is a default unity catalog for a workspace.
df = spark.table('dev.bronze.my_table') # catalog.schema.table
When deploying code from Dev -> Test -> Prod workspace. I would like to avoid having to dynamically set the catalog name for all notebooks using spark.table based on workspace (dev, test, prod). Basically 'bronze.my_table' when working in Dev points to delta table data stored in the dev catalog. While in Prod it points to delta table data stored in the prod catalog. Is this possible? I assume I can use the previous hive_metastore (one for each workspace) and build Unity Catalog on top of it (they reference each other and are in sync). However, isn't the idea that the Unity Catalog replaces the hive_metastore?

There are few approaches to this:
At the beginning of your program issue the use catalog catalog_name SQL command, and then you can continue to use two-level naming for schema+table inside the catalog - continue to use df = spark.table('bronze.my_table')
Incorporate the catalog name variable into table name, like, df = spark.table(f'{catalog_name}.bronze.my_table').
In all cases you need to either explicitly pass the catalog name as command-line option or widget or something like that, or try to map workspace URL to environment.
But really, it's recommended to pass table names as configuration parameters, so you can easily switch not only between catalogs, but also between schemas/databases.

Related

DBT on Databricks Unity Catalog

I've been considering turning on Databricks Unity Catalog in our primary (only) workspace, but I'm concerned about how this might impact our existing dbt loads with the new three-part object references.
I see from the dbt-databricks release notes that you need >= 1.1.1 to get unity support. The snippet with it only shows setting the catalog property in the profile. I was planning on having some of the sources in separate catalog's for the dbt generated objects.
I might even choose to have the dbt generated objects in separate catalogues if this was available.
As turning on Unity Catalog is a one way road in a workspace, I don't wish to wing it and see what happens.
Has anyone used dbt with Unity Catalog and used numerous catalogs in the project?
If so, are there any gotcha's and how do you specify the catalog for sources and specific models?
Regards,
Ashley
specifying 2-part object in schema indeed causes problems, at least in incremental models, instead specify catalog
sql-serverless:
outputs:
dev:
host: ***.cloud.databricks.com
http_path: /sql/1.0/endpoints/***
catalog: hive_metastore
schema: tube_silver_prod
threads: 4
token: ***
type: databricks
target: dev
Thanks Anton, I ended up resolving this. I created a temporary workspace to test it before applying to the main workspace. The catalog attribute can be applied almost anywhere you can specify the schema attribute, not just the profile.yml. I now have a dbt project which targets multiple catalog's. These are set in the dbt_project.yml at appropriate model level.

Is it possible to have a GitLab CICD pipeline for Azure Snowflake DB?

Is there anyone out here who has implemented CI/CD with GitLab for Azure Snowflake? Is this even possible?
Our DB development is growing fast and it's turning out to be challenging experience to develop, maintain and deploy.
We have Visual Studio Code IDE which is now bound to a Git repository who's branches I would like to point to Prod, Dev and Test depending on commits to respective branches.? Also, is it even possible to have something like a Config.sql similar to SQLCMD in SQL Server or Application.properties in a Java Springboot project, where, one can maintain 3 different config files with environment specific variables whose values can be substituted dynamically depending on the BUILD step of the CICD pipeline? I want (at least now) to keep database and schema names as config variables which will be different on where one deploys.
Yes, you can install and configure snowsql tool as part of your CICD pipeline. Snowsql for Snowflake is the same as sqlcmd for SQL Server.
Installing snowsql
Configuring snowsql

Azure Data Factory V2 multiple environments like in SSIS

I'm coming from a long SSIS background, we're looking to use Azure data factory v2 but I'm struggling to find any (clear) way of working with multiple environments. In SSIS we would have project parameters tied to the Visual Studio project configuration (e.g. development/test/production etc...) and say there were 2 parameters for SourceServerName and DestinationServerName, these would point to different servers if we were in development or test.
From my initial playing around I can't see any way to do this in data factory. I've searched google of course, but any information I've found seems to be around CI/CD then talks about Git 'branches' and is difficult to follow.
I'm basically looking for a very simple explanation and example of how this would be achieved in Azure data factory v2 (if it is even possible).
It works differently. You create an instance of data factory per environment and your environments are effectively embedded in each instance.
So here's one simple approach:
Create three data factories: dev, test, prod
Create your linked services in the dev environment pointing at dev sources and targets
Create the same named linked services in test, but of course these point at your tst systems
Now when you "migrate" your pipelines from dev to test, they use the same logical name (just like a connection manager)
So you don't designate an environment at execution time or map variables or anything... everything in test just runs against test because that's the way the linked servers have been defined.
That's the first step.
The next step is to connect only the dev ADF instance to Git. If you're a newcomer to Git it can be daunting but it's just a version control system. You save your code to it and it remembers every change you made.
Once your pipeline code is in git, the theory is that you migrate code out of git into higher environments in an automated fashion.
If you go through the links provided in the other answer, you'll see how you set it up.
I do have an issue with this approach though - you have to look up all of your environment values in keystore, which to me is silly because why do we need to designate the test servers hostname everytime we deploy to test?
One last thing is that if you a pipeline that doesn't use a linked service (say a REST pipeline), I haven't found a way to make that environment aware. I ended up building logic around the current data factories name to dynamically change endpoints.
This is a bit of a bran dump but feel free to ask questions.
Although it's not recommended - yes, you can do it.
Take a look at Linked Service - in this case, I have a connection to Azure SQL Database:
You have possibilities to use dynamic content for either the server name and database name.
Just add a parameter to your pipeline, pass it to the Linked Service and use in the required field.
Let me know whether I explained it clearly enough?
Yes, it's possible although not so simple as it was in VS for SSIS.
1) First of all: there is no desktop application for developing ADF, only the browser.
Therefore developers should make the changes in their DEV environment and from many reasons, the best way to do it is a way of working with GIT repository connected.
2) Then, you need "only":
a) publish the changes (it creates/updates adf_publish branch in git)
b) With Azure DevOps deploy the code from adf_publish replacing required parameters for target environment.
I know that at the beginning it sounds horrible, but the sooner you set up an environment like this the more time you save while developing pipelines.
How to do these things step by step?
I describe all the steps in the following posts:
- Setting up Code Repository for Azure Data Factory v2
- Deployment of Azure Data Factory with Azure DevOps
I hope this helps.

How to delete an experiment from an azure machine learning workspace

I create experiments in my workspace using the python sdk (azureml-sdk). I now have a lot of 'test' experiments littering our workspace. How can I delete individual experiments either through the api or on the portal. I know I can delete the whole workspace but there are some good experiments we don't want to delete
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-export-delete-data#delete-visual-interface-assets suggests it is possible but my workspace view does not look anything like what is shown there
Experiment deletion is a common request and we in Azure ML team are working on it. Unfortunately it's not supported quite yet.
Starting from 2021-08-24 Azure ML Workspace release you can delete the experiment - but only by clicking in UI (Select Experiment in Experiments view -> 'Delete')
Watch out - deleting the experiment will delete all the underlying runs - and deleting a run will delete the child runs, run metrics, metadata, outputs, logs and working directories!
Only for experiments without any underlying runs you can use Python SDK (azureml-core==1.34.0) - Experiment class delete static method, example:
from azureml.core import Workspace, Experiment
aml_workspace = Workspace.from_config()
experiment_id = Experiment(aml_workspace, '<experiment_name>').id
Experiment.delete(aml_workspace, experiment_id)
If an experiment has runs you will get an error:
CloudError: Azure Error: UserError
Message: Only empty Experiments can be deleted. This experiment contains run(s)
I hope Azure ML team gets this functionality to Python SDK soon!
Also on a sad note - would be great if you optimize the deletion - for now it seems like extremely slow (implementation) synchronous (need async as well) call...
You can delete your experiment with the following code:
# Declare your experiment
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="<your_experiment>")
# Delete the experiment
experiment.archive()
# Now check the list of experiments on your AML wokrspace and see that it was deleted
This issue is still opened at the moment. What I have figure out to avoid many experiments in workspace is run locally in Python SDK and after upload output files to the run's outputs folder when the run completes.
You can define it as:
run.upload_file(name='outputs/sample.csv', path_or_stream='./sample.csv')
Follow the two steps:
1.Delete experiment's child jobs in Azure Studio, here is how:
2.Delete the (empty) experiment with Python API, here is how:
from azureml.core import Workspace, Experiment, Run
# choose the workspace and experiment
ws = Workspace.from_config()
exp_name = 'digits_recognition'
# ... delete first experiment's child jobs in Azure Studio
exp = Experiment(ws,exp_name)
Experiment.delete(ws,exp.id)
Note: for a more fine-grained control over deletions, use Azure CLI.

Ideal terraform workspace project structure

I'd like to setup Terraform to manage dev/stage/prod environments. The infrastructure is the same in all environments, but there are differences in the variables in every environment.
What does an ideal Terraform project structure look like now that workspaces have been introduced in Terraform 0.10? How do I reference the workspace when naming/tagging infrastructure?
I wouldn't recommend using workspaces (previously 'environments') for static environments because they add a fair bit of complexity and are harder to keep track of.
You could get away with using a single folder structure for all environments, use workspaces to separate the environments and then use conditional values based on the workspace to set the differences. In practice (and especially with more than 2 environments leading to nested ternary statements) you'll probably find this difficult to manage.
Instead I'd still advocate for separate folders for every static environment and using symlinks to keep all your .tf files the same across all environments and a terraform.tfvars file to provide any differences at each environment.
I would recommend workspaces for dynamic environments such as short lived review/lab environments as this allows for a lot of flexibility. I'm currently using them to create review environments in Gitlab CI so every branch can have an optionally deployed review environment that can be used for manual integration or exploratory testing.
In the old world you might have passed in the var 'environment' when running terraform, which you would interpolate in your .tf files as "${var.environment}".
When using workspaces, there is no need to pass in an environment variable, you just make sure you are in the correct workspace and then interpolate inside your .tf files with "${terraform.workspace}"
As for how you'd manage all of the variables, i'd recommend using a varmap, like so:
variable "vpc_cidr" {
type = "map"
default = {
dev = "172.0.0.0/24"
preprod = "172.0.0.0/24"
prod = "172.0.0.0/24"
}
}
This would then be referenced in a aws_vpc resource using a lookup
"${lookup(var.vpc_cidr, terraform.workspace)}"
The process of creating and selecting workspaces is pretty easy:
terraform workspace
Usage: terraform workspace
Create, change and delete Terraform workspaces.
Subcommands:
show Show the current workspace name.
list List workspaces.
select Select a workspace.
new Create a new workspace.
delete Delete an existing workspace.
so to create a new workspace for pre production you'd do the following:
terraform workspace new preprod
and if you ran a plan, you'd see that there should be no resources. What this will do in the backend is create a new folder to manage the state for 'preprod'.

Resources