Submitting a databricks notebook run specifying a cluster pool? - databricks

I'm submitting a one time, throw away notebook job with:
azuredatabricks.net/api/2.0/jobs/runs/submit
$json = #"
{
"run_name": "integration testing notebook task",
"existing_cluster_id": "$global:clusterID",
"timeout_seconds": 3600,
"notebook_task": {
"notebook_path": "$global:notebookPath"
}
}
"#
However, rather than specify an existing cluster ID, (which I had to create myself initially) I want it to use a cluster from the existing pool. How is this possible? The schema doesn't seem to accept instance_pool_id for this request.

You need to use create request with new_cluster instead, and inside its definition specify the instance_pool_id, the same way as for normal clusters. Something like this:
$json = #"
{
"run_name": "integration testing notebook task",
"new_cluster": : {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r3.xlarge",
"aws_attributes": {
"availability": "ON_DEMAND"
},
"num_workers": 10,
"instance_pool_id": "$global:poolID"
},
"timeout_seconds": 3600,
"notebook_task": {
"notebook_path": "$global:notebookPath"
}
}
"#
But this will create a cluster with machines from a pool, not attach to some cluster that is already allocated there.

Related

How to provide cluster name in Azure Databricks Notebook Run Now JSON

I am able to use the below JSON through POSTMAN to run my Databricks notebook.
I want to be able to give a name to the cluster that is created through the "new_cluster" options.
Is there any such option available?
{
"tasks": [
{
"task_key": "Job_Run_Api",
"description": "To see how the run and trigger api works",
"new_cluster": {
"spark_version": "9.0.x-scala2.12",
"node_type_id": "Standard_E8as_v4",
"num_workers": "1",
"custom_tags": {
"Workload": "Job Run Api"
}
},
"libraries": [
{
"maven": {
"coordinates": "net.sourceforge.jtds:jtds:1.3.1"
}
}
],
"notebook_task": {
"notebook_path": "/Shared/POC/Job_Run_Api_POC",
"base_parameters": {
"name": "Junaid Khan"
}
},
"timeout_seconds": 2100,
"max_retries": 0
}
],
"job_clusters": null,
"run_name": "RUN_API_TEST",
"timeout_seconds": 2100
}
When the above API call is done, the cluster created has a name like "job-5975-run-2" and that is not super explanatory.
I have tried to use the tag "cluster_name" inside the "new_cluster" tag but I got an error that I can't do that, like this:
{
"error_code": "INVALID_PARAMETER_VALUE",
"message": "Cluster name should not be provided for jobs."
}
Appreciate any help here
Cluster name for jobs are automatically generated and can't be changed. If you want somehow track specific jobs, use tags.
P.S. If you want to have more "advanced" tracking capability, look onto Overwatch project.

Attach aws emr cluster to remote jupyter notebook using sparkmagic

I am trying to connect and attach an AWS EMR cluster (emr-5.29.0) to a Jupyter notebook that I am working on my local windows machine. I have started a cluster with Hive 2.3.6, Pig 0.17.0, Hue 4.4.0, Livy 0.6.0, Spark 2.4.4 and the subnets are public. I found that this can be done with Azure HDInsight, so was hoping something similar can be done using EMR. The issue I am having is with passing the correct values in the config.json file. How should I attach a EMR cluster?
I could work on the EMR notebooks native to AWS, but thought I can go the develop locally route and have hit a road block.
{
"kernel_python_credentials" : {
"username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
"password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
"url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "{IAM ACCESS KEY ID}",
"password": "{IAM SECRET ACCESS KEY}",
"url": "{Master public DNS}",
"auth": "None"
},
"kernel_r_credentials": {
"username": "{}",
"password": "{}",
"url": "{}"
},
Update 1/4/2021
On 4/1, I got sparkmagic to work on my local jupyter notebook. Used these documents as a references (ref-1, ref-2 & ref-3) to setup local port forwarding (if possible avoid using sudo).
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Configuration details
Release label:emr-5.32.0
Hadoop distribution:Amazon 2.10.1
Applications:Hive 2.3.7, Livy 0.7.0, JupyterHub 1.1.0, Spark 2.4.7, Zeppelin 0.8.2
Updated config file
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 5,
"livy_server_heartbeat_timeout_seconds": 60,
"heartbeat_retry_seconds": 1,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
Second update 1/9
Back to square one. Keep getting this error and spent days trying to debug. Not sure what I did previously to get things going. Also checked my security group config and it looks fine, ssh on port 22.
An error was encountered:
Error sending http request and maximum retry encountered.
Created a local port forwarding (ssh tunneling) to livy server on port 8998 and it works like magic.
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Did not change my config.json file from 1/4 update

IBM Analytics Engine - Cluster creation fails if i pass Ambari configuration as part of advance options

I using Analytics Engine on IBM Cloud and trying to pass Ambari configuration Like below in Advanced provisioning options.
{
"ambari_config": {
"hardware_config": "default",
"software_package": "ae-1.2-hive-spark",
"num_compute_nodes": 1,
"advanced_options": {
"ambari_config": {
"spark2-defaults": {
"spark.dynamicAllocation.minExecutors": 1,
"spark.shuffle.service.enabled": true,
"spark.dynamicAllocation.maxExecutors": 2,
"spark.dynamicAllocation.enabled": true
}
}
}
}
}
I am following this documentation to pass the above configuration
https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-advanced-provisioning-options
After multiple retires i see that each time my cluster request is failing.
After reviewing my request, I figured out that I am passing ambari_config attribute twice for my request which i not accepted
Valid json which worked for me looks like this
{
"hardware_config": "default",
"software_package": "ae-1.2-hive-spark",
"num_compute_nodes": 1,
"advanced_options": {
"ambari_config": {
"spark2-defaults": {
"spark.dynamicAllocation.minExecutors": 1,
"spark.shuffle.service.enabled": true,
"spark.dynamicAllocation.maxExecutors": 2,
"spark.dynamicAllocation.enabled": true
}
}
}
}
one more scenario where cluster creation can fail is like InvalidTopologyException: The following config types are not defined in the stack: [spar2-hive-site-override]
Above issue was because of TYPO to define config property file where user want to add or modify properties.

Webhook is created, but callback is never hit

I'm trying to remove the pooling and integrate webhooks in the process of file conversion. The problem is that the webhook is created but the callback is never called back.
I'm following the instructions from here: https://forge.autodesk.com/en/docs/webhooks/v1/tutorials/create-a-hook-model-derivative/
The web server is started by the following command : ngrok http host-header=rewrite https://localhost:44366
The callback is http://f36a47b8.ngrok.io/derivative and is up and running. Post requests from postman(internal network) and Post requests from external networks (cellular data) are reaching the endpoint and are successfully redirected.
A hook is created:
"hookId": "51897b50-522a-11ea-b885-f34f23e3435e",
"tenant": "c0761189-32dd-4ca3-9e52-3ae400f91651",
"callbackUrl": "http://f36a47b8.ngrok.io/derivative",
"createdBy": "HUpqLPysSUmbFGlhQo0uG8XMqimfQnRG",
"event": "extraction.updated",
"createdDate": "2020-02-18T08:40:29.829+0000",
"system": "derivative",
"creatorType": "Application",
"status": "active",
"scope": {
"workflow": "c0761189-32dd-4ca3-9e52-3ae400f91651"
},
"urn": "urn:adsk.webhooks:events.hook:51897b50-522a-11ea-b885-f34f23e3435e",
"__self__": "/systems/derivative/events/extraction.updated/hooks/51897b50-522a-11ea-b885-f34f23e3435e"
}
Than a call to modelderivative/v2/designdata/job is issued with the following content:
var job = new JobRequest
{
Input = new Input
{
Urn = urnBase64,
},
Output = new Output
{
Formats = new List<Format>
{
new Format
{
Type = "svf",
Views = new List<string> { "2d", "3d" }
}
},
Destination = new Destination { Region = "EMEA" }
},
Misc = new Misc
{
Workflow = workflowId
}
};
The response is success with an urn (like before);
And from that point nothing follows from the webhook. The callback is never reached, even though that within some time the file is converted and it can be loaded in the viewer as before.
I've viewed those topics:
Unable to receive Forge webhooks, or unable to get them to fire
Why is webhook workflow not taken into consideration when creating modelderivative job?
but they didn't helped.
What am i missing ?
It turns out that there is a problem with jobs for derivative API in 'EMEA' region where no callbacks are called when a job finishes. Changing the region to 'us' fixes the issue and callback is hit when a job event occurs.
From the documentation example change the region parameter:
curl -X 'POST' \
-H 'Content-Type: application/json; charset=utf-8' \
-H 'Authorization: Bearer PtnrvrtSRpWwUi3407QhgvqdUVKL' \
-H 'x-ads-force: false' -v 'https://developer.api.autodesk.com/modelderivative/v2/designdata/job' \
-d
'{
"input": {
"urn": "dXJuOmFkc2sub2JqZWN0czpvcy5vYmplY3Q6bW9kZWxkZXJpdmF0aXZlL0E1LnppcA",
"compressedUrn": true,
"rootFilename": "A5.iam"
},
"output": {
"destination": {
"region": "us" <- Change the region form 'EMEA' to 'us'
},
"formats": [
{
"type": "svf",
"views": [
"2d",
"3d"
]
}
]
}
}'

Unable to use Resource Functions for Azure Resource Manager Templates

My parameters file looks as follows:
{
"$schema":"http://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion":"1.0.0.0",
"parameters":{
"siteName":{
"value":"my-api-application"
},
"appServicePlanName":{
"value":"MyServicePlan"
},
"siteLocation":{
"value":"West US"
},
"vaultResourceGroup": {
"value":"my-vault-res-group"
},
"vaultName": {
"value":"my-keyvault"
},
"nodeEnv": {
"value":"development"
},
"adminPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/yyyyyyyy-xxxx-xxxx-xxxx-yyyyyyyy/resourceGroups/my-vault-res-group/providers/Microsoft.KeyVault/vaults/my-keyvault"
},
"secretName": "adminPassword"
}
}
}
}
The adminPassword value will be picked up from the specified KeyVault, with the particular id. However, I have to hard code the "id" value.
According to this link, I could specify the id using some thing like this:
resourceId(subscription().subscriptionId, parameters('vaultResourceGroup'), 'Microsoft.KeyVault/vaults', parameters('vaultName'))]
However, when using the above syntax/Resource Functions, I receive an error while releasing and deploying my App Service using the VSTS (I used Azure Resource Group Deployment task for APP deployment). The error is some what like this:
The id must be of the following format:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/{resourceProviderNamespace}/{resourceType}/{resourceName}
Not sure what am I doing wrong?
You're not doing anything wrong, that's intentional. You must use a literal resourceId in the parameters file (parameters files don't allow for function use).
If you have a scenario for a dynamic KeyVault id you can use a nested deployment:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-keyvault-parameter#reference-a-secret-with-dynamic-id

Resources