Passing Databricks ClusterID at runtime from Azure Data Bricks Pipeline - azure

I am looking to make Azure linked service configurable and hence passing the Databricks WorkspaceURL and the ClusterID at runtime. I will be having multiple Spark cluster and based on the size of the cluster I would be invoking the type/size of the cluster.
I am not finding an option of getting the DataBricks ClusterID and passit from the ADF pipeline

You can use the REST API Clusters API 2.0 to get cluster list.
https://adb-7012303279496007.7.azuredatabricks.net/api/2.0/clusters/list
I have reproduced the above and got the below result.
First generate the access token in databricks workspace and use that in web activity as authorization to get the list of clusters.
Output from web activity:
The above also contains cluster size in mb. Store the above in an array variable.
For getting the desired cluster id based on cluster size you can use your filter condition as per your requirement.
Here, for sample I have used cluster size in mb as filter condition.
Notebook linked service:
parameter for cluster_id.
Pass the desired cluster_id from filtered array like below.
#activity('Filter1').output.Value[0].cluster_id
You can give the Notebook path using the dynamic content.
My Execution:

Related

PromQL queries in "Azure Monitor Managed Service for Prometheus" don't return the same results as Prometheus instance

On a k8s cluster (v1.23.12) running in Azure (AKS cluster) I have deployed helm chart azure-resourcemanager-exporter-1.0.4 from https://artifacthub.io/packages/container/azure-resourcemanager-exporter/azure-resourcemanager-exporter
Metrics are scraped from the k8s cluster using local instance of Prometheus
see Prometheus version
and forwarded to "Azure Monitor Managed Service for Prometheus" using remote write.
When executing the following PromQL on graph tab of local prometheus instance, I am getting the expected result:
sum by(dimensionValue) (azurerm_costmanagement_detail_actualcost{timeframe="MonthToDate", dimensionValue=~"microsoft.*"})
Fetches all series matching metric name and label filters and calculates sum over dimensions while preserving label "dimensionValue".
result of the query in Prometeus
When I execute the same query in Prometheus explorer blade of my Azure Monitor workspace instance, the query returns the sum of the metric as if the sum over label "dimensionValue" was not there.
query in Prometheus explorer
The label "dimensionValue" do exist in the labels of the metrics in the Azure Monitor workspace
metric label exists
I also tried to scrape the metrics from the exporter using Azure agent in the k8s cluster, using instructions in this article https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-scrape-validate#create-prometheus-configuration-file
(not using remote write).
I get the same results when I execute the same query in Prometheus explorer.

Azure Data Factory API call throws payload limit error

We are performing Copy Activity with API Rest Url as data source and ADLS Gen2 as sink. The pipeline works in most cases and sporadically throws below error. We have nested pipeline to loop through multiple REST API request parameters and make call within forEach activity.
Error displayed in ADF monitor-
Error Code - 2200
Failure Type - User Configuration issue
Details - The payload including configurations on activity/dataset/linked service is too large. Please check if you have settings with very large value and try to reduce its size.
Error message: The payload including configurations on
activity/dataSet/linked service is too large. Please check if you have
settings with very large value and try to reduce its size.
Cause: The payload for each activity run includes the activity configuration, the associated dataset(s), and linked service(s) configurations if any, and a small portion of system properties generated per activity type. The limit of such payload size is 896 KB as mentioned in the Azure limits documentation for Data Factory and Azure Synapse Analytics.
Recommendation: You hit this limit likely because you pass in one or more large parameter values from either upstream activity output or external, especially if you pass actual data across activities in control flow. Check if you can reduce the size of large parameter values or tune your pipeline logic to avoid passing such values across activities and handle it inside the activity instead.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-factory-troubleshoot-guide#payload-is-too-large

how to rename Databricks job cluster name during runtime

I have created an ADF pipeline with Notebook activity. This notebook activity automatically creates databricks job clusters with autogenerated job cluster names.
1. Rename Job Cluster during runtime from ADF
I'm trying to rename this job cluster name with the process/other names during runtime from ADF/ADF linked service.
instead of job-59, i want it to be replaced with <process_name>_
2. Rename ClusterName Tag
Wanted to replace Default generated ClusterName Tag to required process name
Settings for the job can be updated using the Reset or Update endpoints.
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster, pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.
These tags propagate to detailed cost analysis reports that you can access in the Azure portal.
Checkout an example how billing works.

How to create/access Hive tables with external Metastore on additional Azure Blob Storage?

I want to perform some data transformation in Hive with Azure Data Factory (v1) running a Azure HDInsight On Demand cluster (3.6).
Since the HDInsight On Demand cluster gets destroyed after some idle time and I want/need to keep the metadata about the Hive tables (e.g. partitions), I also configured an external Hive metastore, using a Azure SQL Server database.
Now I want to store all production data on a separate storage account than the one "default" account, where Data Factory and HDInsight also create containers for logging and other runtime data.
So I have the following resources:
Data Factory with HDInsight On Demand (as a linked service)
SQL Server and database for Hive metastore (configured in HDInsight On Demand)
Default storage account to be used by Data Factory and HDInsight On Demand cluster (blob storage, general purpose v1)
Additional storage account for data ingress and Hive tables (blob storage, general purpose v1)
Except the Data Factory, which is in location North Europe, all resources are in the same location West Europe, which should be fine (the HDInsight cluster must be in the same location as any storage accounts you want to use). All Data Factory related deployment is done using the DataFactoryManagementClient API.
An example Hive script (deployed as a HiveActivity in Data Factory) looks like this:
CREATE TABLE IF NOT EXISTS example_table (
deviceId string,
createdAt timestamp,
batteryVoltage double,
hardwareVersion string,
softwareVersion string,
)
PARTITIONED BY (year string, month string) -- year and month from createdAt
CLUSTERED BY (deviceId) INTO 256 BUCKETS
STORED AS ORC
LOCATION 'wasb://container#additionalstorage.blob.core.windows.net/example_table'
TBLPROPERTIES ('transactional'='true');
INSERT INTO TABLE example_table PARTITIONS (year, month) VALUES ("device1", timestamp "2018-01-22 08:57:00", 2.7, "hw1.32.2", "sw0.12.3");
Following the documentation here and here, this should be rather straightforward: Simply add the new storage account as an additional linked service (using the additionalLinkedServiceNames property).
However, this resulted in the following exceptions when a Hive script tried to access a table stored on this account:
IllegalStateException Error getting FileSystem for wasb : org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message
139827842123416:error:0D06B08E:asn1 encoding routines:ASN1_D2I_READ_BIO:not enough data:a_d2i_fp.c:247:
139827842123416:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode error:asn_mime.c:192:
139827842123416:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
Some googling told me that this happens, when the key provider is not configured correctly (i.e. the exceptions is thrown because it tries to decrypt the key even though it is not encrypted). After manually setting fs.azure.account.keyprovider.<storage_name>.blob.core.windows.net to org.apache.hadoop.fs.azure.SimpleKeyProvider it seemed to work for reading and "simple" writing of data to tables, but failed again when the metastore got involved (creating a table, adding new partitions, ...):
ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:783)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4434)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:316)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
[...]
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38593)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38561)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:38487)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1103)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1089)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2203)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:99)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:736)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:724)
[...]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:178)
at com.sun.proxy.$Proxy5.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:777)
... 24 more
I tried googling that again, but had no luck finding something usable. I think it may have to do something with the fact, that the metastore service is running separately from Hive and for some reason does not have access to the configured storage account keys... but to be honest, I think this should all just work without manually tinkering with the Hadoop/Hive configuration.
So, my question is: What am I doing wrong and how is this supposed to work?
You need to make sure you also add the hadoop-azure.jar and the azure-storage-5.4.0.jar to your Hadoop Classpath export in your hadoop-env.sh.
export HADOOP_CLASSPATH=/usr/lib/hadoop-client/hadoop-azure.jar:/usr/lib/hadoop-client/lib/azure-storage-5.4.0.jar:$HADOOP_CLASSPATH
And you will need to add the storage key via the following parameter in your core-site.
fs.azure.account.key.{storageaccount}.blob.core.windows.net
When you create your DB and table you need to specify the location using your storage account and the user id
Create table {Tablename}
...
LOCATION 'wasbs://{container}#{storageaccount}.blob.core.windows.net/{filepath}'
If you still have problems after trying the above check to see whether the storage account is a V1 or V2. We had an issue where the V2 storage account did not work with our version of HDP.

azure HDInsight script action

I am trying to copy a file from a accessible data lake to blob storage while spinning up the cluster.
I am using this command from Azure documentation
hadoop distcp adl://data_lake_store_account.azuredatalakestore.net:443/myfolder wasb://container_name#storage_account_name.blob.core.windows.net/example/data/gutenberg
Now, If I am trying to automate this instead of hardcoding, how do I use this in script action. To be specific how can I dynamically get the the container name and storage_account_name associated while spinning up the cluster.
First as below,
A Script Action is simply a Bash script that you provide a URI to, and parameters for. The script runs on nodes in the HDInsight cluster.
So you just need to refer to the offical tutorial Script action development with HDInsight to write your script action and know how to run it. Or you can call the REST API Run Script Actions on a running cluster (Linux cluster only) to run it automatically.
For how to dynamically get the container name & storage account, a way for any language is to call the REST API Get configurations and extract the property of you want from the core-site in the JSON response, or just to call Get configuration REST API with parameter core-site as {configuration Type} in the url and extract the property of you want from the JSON response.
Hope it helps.

Resources