In the 'normal' Spark setup, a Worker can have many Executors.
Why in AZURE Databricks do they have 1 Worker = 1 Executor?
Related
I configured system and user pool on Azure AKS instances. I follow this guide:
Microrosft Guide
before the activity we only had system type pools for applications and system pods as well.
I did the following steps:
creation of a system type pool and set of the following taint "CriticalAddonsOnly = true: NoSchedule" (to avoid deployment on the system pool for application microservices)
conversion of old pools from system to users
restart the following deployments:
gatekeeper-system:
gatekeeper-audit
gatekeeper-controller
kube-system:
coredns
coredns-autoscaler
metrics-server
azure-policy
azure-policy-webhook
konnectivity-agent
ama-logs-rs
to allow the scheduling of system pods also on the pool system since they are not automatically scheduled after pool creation.
Now i'm noticing that the system pods have now been scheduled on the pool system as well but I keep seeing the same pods on all other nodes. Even if I brutally delete them from the user pools, they are immediately redeployed on them. Is the behavior correct? Logically if I have a pool system all pods should only be on that pool and none on the user pool?
Thanks
As per Microsoft official documentation, these are the some features of user node pool and system node pool.
System Node Pool:
Must be running Linux.
They can have a minimum of 1 node, but it is recommended to have 2 nodes or 3 if it is your only Linux node pool.
They only support AKS cluster running on Virtual Machine Scale Sets.
The nodes need at least 2 vCPUs and 4GB memory.
They need to support at least 30 pods.
Cannot be made up of Spot VM’s.
Can have multiple system node pools.
If only one system node pool, it cannot be deleted.
Can be changed to a user node pool if you have another system node pool.
User Node Pool:
User node pools can be either Linux or Windows.
Can scale down to 0 nodes.
Can be deleted with no issues.
Spot VM’s can be used
Can be changed to a system node pool.
Can have as many user node pols as Azure will let you.
As per pod definitions, system pods are bound to be scheduled on system node pool unless controlled by DaemonSet. If a system pod is controlled by DaemonSet, it is bound to be scheduled to on every node present in a cluster regardless of pool type. My cluster has 4 nodes. 2 systems, 2 user. So these system pods exist in kube-system namespace have replicas each for one node.
kubectl get ds -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ama-logs 4 4 4 4 4 <none> 14d
azure-cni-networkmonitor 4 4 4 4 4 <none> 540d
azure-ip-masq-agent 4 4 4 4 4 <none> 540d
kube-proxy 4 4 4 4 4 <none> 540d
To further controll the behaviour of application pod to be not scheduled on system pool. You can add tain on System node pool by this and all application pods will be only scheduled on user node pool.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System
AKS prefer system nodepool when scheduling system pods, but it's not guaranteed that system pods won't be put on a user nodepool when system nodepool does not have enough capacity to schedule all system pods.
Have you checked if your system pool has the required capacity for all system pods?
see the limitaions section of the page you mentioned.
I was creating Databricks job with API. Just want to know if instance pool Id and driver instance pool Id can be same ?
Yes, it's ok to use the same Instance Pool for both worker & driver nodes. And really, if you don't specify a separate pool for driver, then the worker's instance pool will be used if it's configured. This is described in the documentation for clusters, in the description of instance_pool_id setting:
The optional ID of the instance pool to use for cluster nodes. If driver_instance_pool_id is present, instance_pool_id is used for worker nodes only. Otherwise, it is used for both the driver and worker nodes.
Cluster terminated.Reason:Cloud provider launch failure
A cloud provider error was encountered while launching worker nodes. See the Databricks guide for more information.
GCP error message: Compute Quota Exceeded for databricksharish2022 in region asia-southeast1: Quota: SSD_TOTAL_GB, used 0.0 and requested 1000.0 out of 500.0
I have kubernetes cluster running on Azure Virtual Machine Scale Set. I use Kubernetes Cluster Autoscaler to scale the number of nodes. It works fine, if i set limit from 1 to 10 but the problem appears when i set limit from 0 in one particular case:
When the number of nodes has been scaled to 0 and after this operation pod with cluster autoscaler restarted. Then i want to run pod on this VMSS (pod with nodeSelector - agentpool: memory), but it looks like autoscaler can't read appropriate labels from VMSS when number of instance is scaled to 0.
According to documentation i add the following tag to the VMSS k8s.io_cluster-autoscaler_node-template_label_agentpool: memory.
I have logs from autoscaler pod:
GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I have an oozie workflow that I'd like to run on an HDInsight cluster. My job has a jar file as well as a workflow.xml file that I store on the Azure blob storage. However the only way I found to store the job.config file is on the local storage of the HDInsight headnode. However my concern is what happens when the VM gets re-imaged? does it remove my job.config file?
In general, you can use Script Actions on HDInsight. Script actions perform customization on the HDInsight clusters during provisioning. So every time the cluster is created, the scripts will be run. (You were smart to be concerned about what happens when the cluster is re-created!)
In these advanced configuration options, it shows HDInsight cluster customization during the provision process using PowerShell. There is an oozie section:
# oozie-site.xml configuration
$OozieConfigValues = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightOozieConfiguration'
$OozieConfigValues.Configuration = #{ "oozie.service.coord.normal.default.timeout"="150" } # default 120
Does that help?
Other resources:
Customizing HDInsight Cluster provisioning
Oozie tutorial on HDInsight