HA in databrick clusters in Azure and AWS - databricks

It is clear the benefits of spark to deal with vms disruption within a zone, but what happens if a zone goes down. Are the vms in the running cluster failing over to a second zone in the same region or is this something to be done manually?
br,
Mike

databricks is being deployed in following mode :
Active deployment: Users can connect to an active deployment of a Databricks workspace and run workloads. Jobs are scheduled periodically using Databricks scheduler or other mechanism. Data streams can be executed on this deployment as well. Some documents might refer to an active deployment as a hot deployment.
Passive deployment: Processes do not run on a passive deployment. IT teams can setup automated procedures to deploy code, configuration, and other Databricks objects to the passive deployment. A deployment becomes active only if a current active deployment is down. Some documents might refer to a passive deployment as a cold deployment.
For more details , you can ref link

Related

AKS randomly change deployments and pods

I am investigating a robust way to scan my Azure AKS clusters and randomly change the numbers of pods, allocated resources, throttling and if possible limit connections to other resources (E.g. database, queues, cache).
The idea is to have this running against any environment (test, QA, live)
Log what changes where made and when
Email that the script has run
Return environment to desired state
My questions are:
Is there tooling for this already?
If this possible via CRON/ Azure pipelines?
This is part of my stress development work cycle that includes API integration and load testing to help find weakness and feedback ways we can improve our offering and teams reputation
Google "Kubernetes chaos engineering".
Look at Azure Chaos Studio https://azure.microsoft.com/en-us/products/chaos-studio/#overview
Create a chaos experiment that uses a Chaos Mesh fault to kill AKS pods with the Azure portal https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-tutorial-aks-portal

Trigger azure kubernetes cronjob by azure function

As per requirements we have docker container batch job running on azure kubernetes service which need to be trigger on arrival of files at azure blob storage location, please advise how this is possible or doable.
Or any other design alternative for example if we can use azure function to start batch job running on aks pod on the arrival of file in azure blob storage location. Is this approach possible
Executing the workload in functions is not an option?
You could use EventGrid to collect the Azure File Events or send a message to Azure Service Bus/Storage Queue and from a pod listen to those events/messages. However this would require the pod to be running all the time
If you want true "event based processing" within Kubernetes your best option is Keda
If your Kubernetes Management Plane is exposed to the internet or reachable from a VNet (you could connect azure functions to it (requires premium though)), you could execute kubectl commands. But I would recommend using one of the solutions above.
Keep in mind - the Azure functions blob trigger is not always 100% reliable. Use Messaging or Event Grid if possible.

Reading from AKS Master node

From whatever i read, i could not find a way to connect to master node in Azure kubernetes Service. I have a requirement to read some parameters like 'enable-admission-plugins' which is possible from master node. Is there any third party api available to get this info.
More specific i need to read the files 'kube-apiserver.yaml', 'kube-controller-manager.yaml'
No, this is not possible. Masters are managed by Microsoft and you dont have access to them. All the configurations are to be done through the AKS api (mostly when you create it).
Azure Kubernetes Service (AKS) makes it simple to deploy a managed
Kubernetes cluster in Azure. AKS reduces the complexity and
operational overhead of managing Kubernetes by offloading much of that
responsibility to Azure. As a hosted Kubernetes service, Azure handles
critical tasks like health monitoring and maintenance for you. The
Kubernetes masters are managed by Azure. You only manage and maintain
the agent nodes.

Service fabric cluster takes forever to deploy

Is there a way to shorten the process? Should I have two service fabric clusters if we want to implement continuous delivery process ?
If the Service Fabric cluster deployment (i.e. creation of a Service Fabric cluster) is stuck - open an issue in the Azure Portal with support to help get it resolved.
For application deployment you don't need separate cluster to do CD. Depending on your CD strategy (e.g. rolling upgrades, rip and replace, blue/green), there are various ways of doing that in Service Fabric. Take a look here for some of the conceptual documentation on this topic: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade

Allocating temporary VMs to parallelize one-shot batch jobs (GCP, Azure or AWS)

I am evaluating options for launching arbitrary Python tasks/scripts on temporary cloud VMs that get shut down once the job is complete. I am looking across all cloud providers, but the ideal solution should not be vendor-specific. Here is what I found:
Docker Swarm / Kubernetes / Nomad for spinning up docker containers. All look attractive, but cannot confirm if they can terminate VMs once the task is complete.
Cloud Functions/Lambdas look great, but works only for short-lived (few minutes) tasks. Also, GCP supports only JavaScript.
Spins up/down VMs explicitly from a launch script with vendor specific commands. Straightforward and should work.
AWS Batch, Azure Batch - vendor-specific services for batch jobs
AWS Data Pipeline, Azure Data Factory, Google Dataflow - vendor-specific services for data pipelines
Did I miss any good option? Does any container orchestration service like Docker Swarm support allocation and deallocation of multiple temporary VMs to run a one-shot job?

Resources