I'm trying to use workspace environment variables to pass access tokens into my custom cluster init scripts.
It appears that there are only a few supported environment variables that we can access in our custom cluster init scripts as described at https://docs.databricks.com/clusters/init-scripts.html#environment-variables
I've attempted to write to the base cluster configuration using
Microsoft.Azure.Databricks.Client.SparkEnvironmentVariables.Add("WORKSPACE_ID", workspaceId)
My init scripts are still failing to uptake this variable in the following line:
[[ -z "${WORKSPACE_ID}" ]] && LOG_ANALYTICS_WORKSPACE_ID='default' || LOG_ANALYTICS_WORKSPACE_ID="${WORKSPACE_ID}"
With the above lines of code, my init script causes the cluster to fail with the following error:
Spark Error: Spark encountered an error on startup. This issue can be caused by
invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark
driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark error: Driver down
The logs don't say that any part of my bash script is failing, so I'm assuming that it's just failing to pick up the variable from the environment variables.
Has anyone else dealt with a problem with this? I realize that I could write this information to dbfs, and then read it into the init script, but I'd like to avoid doing that since I'll be passing in access tokens. What other approaches can I try?
Thanks for any help!
This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
Steps to build the Azure monitoring library and configure an Azure Databricks cluster:
Step1: Build the Azure Databricks monitoring library
Step2: Create and configure the Azure Databricks cluster
For more details, refer "Monitoring Azure Databricks".
Hope this helps.
Related
I use Azure Devops pipelines. You can run a script on the database level with Bicep, that is listed clearly in the documents. But I want to run a script on cluster level to update the workload_group policy to increase the allowed concurrent queries. But when running the query as part of the bicep deployment (on the database script property) to alter this it results in the following error:
Reason: Not a database-scope command
How can I run this query (that should indeed be run on a cluster level) as part of the bicep deployment? I use the following query, that does work when running it in the query window in Azure Portal.
.create-or-alter workload_group ['default'] ```
<<workgroupConfig>>
```.
I also know there are tasks for Azure Devops for running scripts against the database, but I would not like to use those since data explorer is in a private network and not accessible publicly.
I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?
The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.
I have azure databricks workspace and I added service principal in that workspace using databricks cli. I have been trying to create cluster using service principal and not able to figure it. Can any help me?
I am able to create cluster using my account but I want to create using Service Principal and want it to be the owner of the cluster not me.
Also, it there a way I can transfer the ownership of my cluster to Service Principal?
First, answering the second question - no, you can't change the owner of the cluster.
To create a cluster that will have Service Principal as owner you need to execute creation operation under its identity. To do this you need to perform following steps:
Prepare a JSON file with cluster definition as described in the documentation
Set DATABRICKS_HOST environment variable to an address of your workspace:
export DATABRICKS_HOST=https://adb-....azuredatabricks.net
Generate AAD token for Service principal as described in documentation and assign its value to DATABRICKS_TOKEN or DATABRICKS_AAD_TOKEN environment variables (see docs).
Create Databricks cluster using databricks-cli providing name of JSON file with cluster specification (docs):
databricks clusters create --json-file create-cluster.json
P.S. Another approach (really recommended) is to use Databricks Terraform provider to script your Databricks infrastructure - it's used by significant number of Databricks customers, and much easier to use compared with command-line tools.
I have tried to access the secret {{secrets/secrectScope/Key}} in advanced tab of databricks cluster and it is working fine. But when I try to use the same in databricks init script, it is not working it.
What are the steps to do that?
Another answer is correct regarding the syntax of the secrets reference (so-called "secret paths"), but it won't work for init scripts, although it will work for Spark code itself.
To pass the secret to the init script you need to put the secrets path into the "Environment Variables" sections of the Spark configuration tab, like this:
And after that you can use the variable by name inside the init script:
if [ -n "$SECRET_VAR" ]; then
do_something_with_it
fi
Here are the steps to access secrets in databricks initscript:
Go to cluster
Click Edit next to the Cluster information.
On the Configure Cluster page, click Advanced Options.
On the Spark tab, enter the following Spark Config:
Sample ini code:
fs.azure.account.auth.type.chepragen2.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.chepragen2.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.chepragen2.dfs.core.windows.net {<!-- -->{secrets/KeyVaultName/ClientID}}
fs.azure.account.oauth2.client.secret.chepragen2.dfs.core.windows.net {<!-- -->{secrets/KeyVaultName/ClientSecret}}
fs.azure.account.oauth2.client.endpoint.chepragen2.dfs.core.windows.net https://login.microsoftonline.com/<Directory_ID>/oauth2/token
For more details, refer Azure Databricks - configure the cluster to read secrets from the secret scope.
I'm trying to use the azure command-line interface.
I imported the manifest file and am able to run azure hdinsight -h and azure account list (which gives me the good credentials).
However, I'm unable to list my HDInsight clusters with
azure hdinsight cluster list
This returns me the following error :
- Getting HDInsight serverserror: tunneling socket could not be established, cause=1500:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:openssl\ssl\s23_clnt.c:766:
info: Error information has been recorded to azure.err
error: hdinsight cluster list command failed
I get a similar error message when doing azure hdinsight account storage create storagename
Did I miss a step in the installation or is there something wrong going on ? I'm working behind a proxy and got http_proxy and https_proxy well set.
In order to proceed ahead with the project, you could also launch the Powershell from the portal itself and execute the PowerShell commandlet from there. (It is called CloudShell in Azure, click this highlighted icon I just launched the PowerShell windows from portal and executed "azure hdinsight cluster list" and it returned me the list of my clusters.
more details about the Azure Powershell at :
https://azure.microsoft.com/en-us/blog/powershell-comes-to-azure-cloud-shell/
and
https://learn.microsoft.com/en-us/azure/cloud-shell/quickstart-powershell