Azure Databricks Execution Fail - CLOUD_PROVIDER_LAUNCH_FAILURE - azure

I'm using Azure DataFactory for my data ingestion and using an Azure Databricks notebook through ADF's Notebook activity.
The Notebook uses an existing instance pool of Standard DS3_V2 (2-5 nodes autoscaled) with 7.3LTS Spark Runtime version. The same Azure subscription is used by multiple teams for their respective data pipelines.
During the ADF pipeline execution, I'm facing a notebook activity failure frequently with the below error message
{
"reason": {
"code": "CLOUD_PROVIDER_LAUNCH_FAILURE",
"type": "CLOUD_FAILURE",
"parameters": {
"azure_error_code": "SubnetIsFull",
"azure_error_message": "Subnet /subscriptions/<Subscription>/resourceGroups/<RG>/providers/Microsoft.Network/virtualNetworks/<VN>/subnets/<subnet> with address prefix 10.237.35.128/26 does not have enough capacity for 2 IP addresses."
}
}
}
Can anyone explain what this error is and how I can reduce the occurrence of this? (The documents I found are not explanatory)

Looks like your data bricks has been created within a VNET see this link or this link. When this is done, the databricks instances are created within one of the subnets within this VNET. It seems that at the point of triggering, all the IPs within the subnet were already utilized.
You cannot ad should not extend the IP space. Please do not attempt to change the existing VNET configuration as this will affect your databricks cluster.
You have the following options.
Check when less number of databricks instances are being instantiated and
schedule your ADF during this time. You should be looking at
distributing the execution across the time so we don't attempt to
peak over the existing IPs in the subnet.
Request your IT department to create a new VNET and subnet and
create a new Databricks cluster in this VNET.

The problem arise from the fact that when your workspace was created, the network and subnet sizes wasn't planned correctly (see docs). As result, when you're trying to launch a cluster, then there is not enough IP addresses in a given subnet, and given this error.
Unfortunately right now it's not possible to expand network/subnets size, so if you need a bigger network, then you need to deploy a new workspace and migrate into it.

Related

how to rename Databricks job cluster name during runtime

I have created an ADF pipeline with Notebook activity. This notebook activity automatically creates databricks job clusters with autogenerated job cluster names.
1. Rename Job Cluster during runtime from ADF
I'm trying to rename this job cluster name with the process/other names during runtime from ADF/ADF linked service.
instead of job-59, i want it to be replaced with <process_name>_
2. Rename ClusterName Tag
Wanted to replace Default generated ClusterName Tag to required process name
Settings for the job can be updated using the Reset or Update endpoints.
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster, pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.
These tags propagate to detailed cost analysis reports that you can access in the Azure portal.
Checkout an example how billing works.

Storage account connectivity method for AKS

I'm setting up a Storage Account so I can Dynamically create and use a persistent volume with Azure Files in Azure Kubernetes Service (AKS). Doing this to:
Have a PV and PVC for the database
A place to store the application files
AKS does create a storage account in the MC_<resource-group>_<aks-name>_<region> resource group that is automatically created. However, that storage account is destroyed if the node size/VM is changed (not node count), so it shouldn't be used since you'll lose your files and database if you need a node size/VM with more resources.
This documentation, nor any other I've really come across, says what the best practice is for the Connectivity method:
Public endpoint (all networks)
Public endpoint (selected networks)
Private endpoint
The first option sounds like a bad idea.
The second option allows me to select a virtual network, and there are two choices:
MC_<resource-group>_<aks-name>_<region>... again, doesn't seem like a good idea because if the node size/VM is changed, the connection will be broke.
aks-vnet-<number>... not sure what this is, but looks like it is part of the previous resource group so will also be destroyed in the previously mentioned scenario.
The third option contains a number of options some of which are included the second option.
So how should I securely set this up for AKS to share files with the application and persist database files?
EDIT
Looking at the both the "Firewalls and virtual networks" and "Private endpoint connections" for the storage account that comes with the AKS node, it looks like it is just setup for "All networks"... so maybe having that were my actual PV and PVC will be stored isn't such an issue...? Could use some clarity on the topic.
not sure where the problem lies. all the assets generated by AKS are tied to AKS lifecycle. if you delete AKS it will delete the MC_* resource group (and that it 100% right). Not sure what do you mean about storage account being destroyed, it wouldn't get destroyed unless you remove the pvc and set the delete action to reclaim.
Reading: https://learn.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv
As for the networking part, selected networks with selecting the AKS nodes network should be the way to go. you can figure that network out by looking at the AKS nodes or the AKS agent pool definition(s). I dont think this is configurable only using kubernetes primitives, so that would be a manual\scripted action after storage account is created.

Fail to create in-demand hadoop cluster in Azure Data Factory; additionalProperties is not active

It's my first time trying out the Azure data factory so I hope this is not a bad question to ask.
So I'm using the Azure portal trying to create an on-demand hadoop cluster as one of the linked service in Azure Data Factory following the steps in the tutorial.
But whenever I click create, the following error message pops up.
Failed to save HDinisghtLinkedService. Error: An additional property 'subnetName' has been specified but additionalProperties is not active.The relevant property is 'HDInsightOnDemandLinkedServiceTypeProperties'.The error occurred at the location 'body/properties/typeProperties' in the request.;An additional property 'virtualNetworkId' has been specified but additionalProperties is not active.The relevant property is 'HDInsightOnDemandLinkedServiceTypeProperties'.The error occurred at the location 'body/properties/typeProperties' in the request.
I couldn't understand why it requires the 'subnetName' and 'virtualNetworkId'. But I tried putting values under Advanced Properties -> Chose Vnet and Subnet -> From Azure subscription -> and put in the existing vitrual network ID and subnet name. But the problem still present and the same error message shows up.
Other background information:
For the tutorial I posted above, I did not use its powershell code. I have existing resource group and created a new storage account on the Azure portal.
I also created a new app registration in Azure Active Directory and retrieve principal service application ID and authentication key following this link
Some parameters:
Type: On-demand HDInsight
Azure Storage Linked Service: the one listed in the connection
Cluster size: 1 (for testing)
Service principal id/service principal key: described above
Version: 3.6
...
Any thoughts or anything I might be doing wrong?
From the error message, it clearly states that “subnetName” is not active, which means it has not created at all.
Note: If you want to create on-demand cluster within your Vnet, then first create Vnet and Subnet and the pass the following values.
Advanced Properties are not mandatory to create a on-demand cluster.
Have you tried created on-demand cluster without passing the Vnet and Subnet?
Hope this helps. Do let us know if you any further queries.

Is it possible to change subnet in Azure AKS deployment?

I'd like to move an instance of Azure Kubernetes Service to another subnet in the same virtual network. Is it possible or the only way to do this is to recreate the AKS instance?
No, it is not possible, you need to redeploy AKS
edit: 08.02.2023 - its actually possible to some extent now: https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-dynamic-ip-allocation#configure-networking-with-dynamic-allocation-of-ips-and-enhanced-subnet-support---azure-cli
I'm not sure it can be updated on an existing cluster without recreating it (or the nodepool)
I know its an old thread, but just responding in case someone might find it useful. You cannot change the subnet of the AKS directly. However, you can always change the subnets of the underlying components. In our case, we had a simple setup of 2 nodes and a LoadBalancer. We created a new subnet and change the subnets on these individual components. It worked for us, so do ensure to check the services and the pods, to ensure correct working.

Using Packer to Spin a VM and extract the image in an availability set

We have our corporate requirement ( due to pricing and whitelisting) to have Availability sets in our Azure subscription and resources like Compute should be spun inside that particular availability set. Since Packer while creating the Image spins up a temporary VM inside a temporary resource Group , I am confused (since did not find any documentation around it) if we can configure packer to spin the temporary VM inside the whitelisted availability set.
One possible way I can think of is to spin up the VM in the Resource Group which we created for the Availability Set (Since everything in Azure needs to be inside the Resource Group) that way I am guessing it will be tracked as part of billing but I am still not sure if the intermittent VM will be part of availability set.
Please help and suggest if there is an alternate way to the same .

Resources