How to access secrets in databricks initscript - databricks

I have tried to access the secret {{secrets/secrectScope/Key}} in advanced tab of databricks cluster and it is working fine. But when I try to use the same in databricks init script, it is not working it.
What are the steps to do that?

Another answer is correct regarding the syntax of the secrets reference (so-called "secret paths"), but it won't work for init scripts, although it will work for Spark code itself.
To pass the secret to the init script you need to put the secrets path into the "Environment Variables" sections of the Spark configuration tab, like this:
And after that you can use the variable by name inside the init script:
if [ -n "$SECRET_VAR" ]; then
do_something_with_it
fi

Here are the steps to access secrets in databricks initscript:
Go to cluster
Click Edit next to the Cluster information.
On the Configure Cluster page, click Advanced Options.
On the Spark tab, enter the following Spark Config:
Sample ini code:
fs.azure.account.auth.type.chepragen2.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.chepragen2.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.chepragen2.dfs.core.windows.net {<!-- -->{secrets/KeyVaultName/ClientID}}
fs.azure.account.oauth2.client.secret.chepragen2.dfs.core.windows.net {<!-- -->{secrets/KeyVaultName/ClientSecret}}
fs.azure.account.oauth2.client.endpoint.chepragen2.dfs.core.windows.net https://login.microsoftonline.com/<Directory_ID>/oauth2/token
For more details, refer Azure Databricks - configure the cluster to read secrets from the secret scope.

Related

Databricks SQL Editor "Failure to initialize configuration"

When I'm trying to select something from one specific table in SQL Editor, I'm getting an error "Failure to initialize configuration".
The query is simple as select * from table_name. Tried also with limits and/or selecting specific columns, but got the same error.
If I switch to "Data Science & Engineering" and execute the same query using a regular cluster in a notebook everything works.
Edit the Spark Config by entering the connection information for your Azure Storage account.
This will allow your cluster to access the files. Enter the following:
spark.hadoop.fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net <ACCESS_KEY>, where <STORAGE_ACCOUNT_NAME> is your Azure Storage account name, and <ACCESS_KEY> is your storage access key.
If using Azure Key vault, you can create a KeyVault backed secret scope (https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes) and access the values via the following syntax in your spark config: {{secrets//}}

"No Isolation Shared" Databricks job cluster through CLI

I turned on Unity Catalog for our workspace. Now a job cluster has an access mode setting. (docs) I can manually change this setting on the UI:
But how do I control this setting when creating the job through databricks jobs create --json-file X.json?
You need to specify the data_security_mode with value "NONE" in the cluster definition (for some reason it's missing from API docs, but you can find details in the Terraform provider docs). But really it should be the default value, so you don't need to explicitly specify it.

How to create Azure databricks cluster using Service Principal

I have azure databricks workspace and I added service principal in that workspace using databricks cli. I have been trying to create cluster using service principal and not able to figure it. Can any help me?
I am able to create cluster using my account but I want to create using Service Principal and want it to be the owner of the cluster not me.
Also, it there a way I can transfer the ownership of my cluster to Service Principal?
First, answering the second question - no, you can't change the owner of the cluster.
To create a cluster that will have Service Principal as owner you need to execute creation operation under its identity. To do this you need to perform following steps:
Prepare a JSON file with cluster definition as described in the documentation
Set DATABRICKS_HOST environment variable to an address of your workspace:
export DATABRICKS_HOST=https://adb-....azuredatabricks.net
Generate AAD token for Service principal as described in documentation and assign its value to DATABRICKS_TOKEN or DATABRICKS_AAD_TOKEN environment variables (see docs).
Create Databricks cluster using databricks-cli providing name of JSON file with cluster specification (docs):
databricks clusters create --json-file create-cluster.json
P.S. Another approach (really recommended) is to use Databricks Terraform provider to script your Databricks infrastructure - it's used by significant number of Databricks customers, and much easier to use compared with command-line tools.

How to Pass Variables into Azure Databricks Cluster Init Script

I'm trying to use workspace environment variables to pass access tokens into my custom cluster init scripts.
It appears that there are only a few supported environment variables that we can access in our custom cluster init scripts as described at https://docs.databricks.com/clusters/init-scripts.html#environment-variables
I've attempted to write to the base cluster configuration using
Microsoft.Azure.Databricks.Client.SparkEnvironmentVariables.Add("WORKSPACE_ID", workspaceId)
My init scripts are still failing to uptake this variable in the following line:
[[ -z "${WORKSPACE_ID}" ]] && LOG_ANALYTICS_WORKSPACE_ID='default' || LOG_ANALYTICS_WORKSPACE_ID="${WORKSPACE_ID}"
With the above lines of code, my init script causes the cluster to fail with the following error:
Spark Error: Spark encountered an error on startup. This issue can be caused by
invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark
driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark error: Driver down
The logs don't say that any part of my bash script is failing, so I'm assuming that it's just failing to pick up the variable from the environment variables.
Has anyone else dealt with a problem with this? I realize that I could write this information to dbfs, and then read it into the init script, but I'd like to avoid doing that since I'll be passing in access tokens. What other approaches can I try?
Thanks for any help!
This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
Steps to build the Azure monitoring library and configure an Azure Databricks cluster:
Step1: Build the Azure Databricks monitoring library
Step2: Create and configure the Azure Databricks cluster
For more details, refer "Monitoring Azure Databricks".
Hope this helps.

azure HDInsight script action

I am trying to copy a file from a accessible data lake to blob storage while spinning up the cluster.
I am using this command from Azure documentation
hadoop distcp adl://data_lake_store_account.azuredatalakestore.net:443/myfolder wasb://container_name#storage_account_name.blob.core.windows.net/example/data/gutenberg
Now, If I am trying to automate this instead of hardcoding, how do I use this in script action. To be specific how can I dynamically get the the container name and storage_account_name associated while spinning up the cluster.
First as below,
A Script Action is simply a Bash script that you provide a URI to, and parameters for. The script runs on nodes in the HDInsight cluster.
So you just need to refer to the offical tutorial Script action development with HDInsight to write your script action and know how to run it. Or you can call the REST API Run Script Actions on a running cluster (Linux cluster only) to run it automatically.
For how to dynamically get the container name & storage account, a way for any language is to call the REST API Get configurations and extract the property of you want from the core-site in the JSON response, or just to call Get configuration REST API with parameter core-site as {configuration Type} in the url and extract the property of you want from the JSON response.
Hope it helps.

Resources