DevOps for Azure Databricks Jobs - azure

I am trying to implement DevOps on Azure Databricks.
I have completed devops implementation for databricks notebooks and dbfs files.
I do have many databricks jobs running on my cluster based on schedule.
Some of these jobs points to notebook files and few points to jar file in the dbfs location.
Is there any way to implement devops process on the azure databricks jobs so that any change in any of the jobs in DEV will invoke build pipeline and deploy the same in PROD databricks instance.
First of all I wanted to know whether it is possible to implement devops on azure databricks jobs.
Any Leads Appreciated!

To do this effectively, I would recommend to use Databricks Terraform provider for that - in this case the definition of the job could be stored in the Git or something like, and then it's easy to integrate with CI/CD systems, such as Azure DevOps, GitHub Actions, etc.
The differences between environments could be the coded as variables with different files with variables for different environments, etc., so you can re-use the main code between environments, like this:
provider "databricks" {
host = var.db_host
token = var.db_token
}
data "databricks_spark_version" "latest" {}
data "databricks_node_type" "smallest" {
local_disk = true
}
resource "databricks_job" "this" {
name = "Job"
new_cluster {
num_workers = 1
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
}
notebook_task {
notebook_path = "path_to_notebook"
}
email_notifications {}
}
P.S. Theoretically, you can implement some periodic task that will pull the jobs definitions from your original environment, and check if the jobs definitions has changed, and apply the changes to another environment. You can even track changes to the jobs definitions via diagnostic logs, and use that as trigger.
But all of this is just hacks - it's better to use Terraform.

Related

cloud run deployment pattern when new images are pushed, if services are created via terraform, is it avoidable?

I have a set of cloud run services created/maintained via terraform cloud.
When I create a new version, a github actions workflow pushes a new image to gcr.io.
Now in a normal scenario, I'd call:
gcloud run deploy auth-service --image gcr.io/riu-production/auth-service:latest
And a new version would be up. If I do this and the resource is managed by terraform, on the next run, terraform apply will fail saying it can't create that cloud run service due to a service with that name already existing. So it drifts apart in state and terraform no longer recognizes it.
A simple solution is to connect the pipeline to terraform cloud and run terraform apply -auto-approve for deployment purposes. That should work.
The problem with that is I really realy don't want to apply terraform commands in a pipeline, for now.
And the biggest one is I really would like to keep terraform out of the deployment process altogether.
Is there any way to force cloud run to take that new image for a service without messing up the terraform infrastructure?
Cloud run configs:
resource "google_cloud_run_service" "auth-service" {
name = "auth-service"
location = var.gcp_region
project = var.gcp_project
template {
spec {
service_account_name = module.cloudrun-sa.email
containers {
image = "gcr.io/${var.gcp_project}/auth-service:latest"
}
}
}
traffic {
percent = 100
latest_revision = true
}
}
In theory yes it should be possible ...
But I would recommend against that, you should be doing terraform apply on every deployment to guarantee the infrastructure is as expected.
Here are some things you can try:
Keep track of when it changes and use the import on that resource:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_service#import
Look into lifecycle ignore, you can ignore the attribute that triggers the change:
https://www.terraform.io/language/meta-arguments/lifecycle#ignore_changes

Creating multiple AWS GLue jobs using Terraform

I am new to Terraform, so I am looking for some advice.
I need to deploy 30+ AWS Glue jobs (Python) using Terraform which will be executed by a Jenkins pipeline.
Looking at Terraform documentation, creating a single AWS Glue is pretty straight forward.
resource "aws_glue_job" "example" {
name = "example"
role_arn = aws_iam_role.example.arn
command {
script_location = "s3://${aws_s3_bucket.example.bucket}/example.py"
}
}
How can I take this example and deploy 30+ jobs using a single Terraform script. Ideally, I could maintain a "manifest" file that includes entries for job-names, script location, etc. and somehow loop through it. But I am open to suggestions.

Best way to store Terraform variable values without having them in source control

We have a code repo with our IaC in Terraform. This is in Github, and we're going to pull the code, build it, etc. However, we don't want the values of our variables in Github itself. So this may be a dumb question, but where do we store the values we need for our variables? If my Terraform requires an Azure subscription id, where would I store the subscription id? The vars files won't be in source control. The goal is that we'll be pulling the code into an Azure Devops pipeline so the pipeline will have to know where to go to get the input variable values. I hope that makes sense?
You can store your secrets in Azure Key Vault and retrieve them in Terraform using azurerm_key_vault_secret.
data "azurerm_key_vault_secret" "example" {
name = "secret-sauce"
key_vault_id = data.azurerm_key_vault.existing.id
}
output "secret_value" {
value = data.azurerm_key_vault_secret.example.value
}
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/data-sources/key_vault_secret
There has to be a source of truth eventually.
You can store your values in the pipeline definitions as variables themselves and pass them into the Terraform configuration.
Usually it's a combination of tfvar files (dependent on target environment) and some variables from the pipeline. If you do have vars in your pipelines though, the pipelines should be in code.
If the variables are sensitive then you need to connect to a secret management tool to get those variables.
If you have many environments, say 20 environments and the infra is all the same with exception of a single ID you could have the same pipeline definition (normally JSON or YAML) and reference it for the 20 pipelines you build, each of those 20 would have that unique value baked in for use at execution. That var is passed through to Terraform as the missing piece.
There are other key-value property tracking systems out there but Git definitely works well for this purpose.
You can use Azure DevOps Secure files (pipelines -> library) for storing your credentials for each environment. You can create a tfvar file for each environment with all your credentials, upload it as a secure file in Azure DevOps and then download it in the pipeline with a DownloadSecureFile#1 task.

how to rename Databricks job cluster name during runtime

I have created an ADF pipeline with Notebook activity. This notebook activity automatically creates databricks job clusters with autogenerated job cluster names.
1. Rename Job Cluster during runtime from ADF
I'm trying to rename this job cluster name with the process/other names during runtime from ADF/ADF linked service.
instead of job-59, i want it to be replaced with <process_name>_
2. Rename ClusterName Tag
Wanted to replace Default generated ClusterName Tag to required process name
Settings for the job can be updated using the Reset or Update endpoints.
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster, pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.
These tags propagate to detailed cost analysis reports that you can access in the Azure portal.
Checkout an example how billing works.

exclude azure data factory connections and integration runtime from azure devops sync

So we have configured ADF to use GIT under DevOps.
Problem is our connection details are getting synced between dev\qa\master branches which are causing issues as each environment has its own SQL Servers.
Is there any way to keep connections and IR out of sync operation between branches?
Look at this similar post Which also asks how to use parameters for SQL connection information in ADF.
Your solution should also leverage Managed Identities for creating the access policies in the Key Vault this can be done via ARM.
One additional comment would be that the Linked Services would be where the parameter substitutions of these values would occur.
Connections rather have to be parameterized than removed from a deployment pipeline.
Parameterization can be done by using "pipeline" and "variable groups" variables
As an example, a pipeline variable adf-keyvault can be used to point to a rigt KeyVault instance that belongs to a certain environment:
adf-keyvault = "adf-kv-yourProjectName-$(Environment)"
Variable $Environment is declared on a variable groups level, so each environment has own value mapped, for instance:
$Environment = 'dev' #development
$Environment = 'stg' #staging
$Environment = 'prd' #production
Therefore the final value of adf-keyvault, depending on environment, resolves into:
adf-keyvault = "adf-kv-yourProjectName-dev"
adf-keyvault = "adf-kv-yourProjectName-stg"
adf-keyvault = "adf-kv-yourProjectName-prd"
And each Key Vault stores connection string to a database server in secret with the same name across environments. For instance:
adf-sqldb-connectionstring = Server=123.123.123.123;Database=adf-sqldb-dev;User Id=myUsername;Password=myPassword;
Because an initial setup of CI/CD pipelines in Azure Data Factory can be complex in a first glance, I blogged recently a step-by-step guide about this topic: Azure Data Factory & DevOps – Setting-up Continuous Delivery Pipeline

Resources