Multi regional Azure Container Service DC/OS clusters - azure

I'm experimenting a little with ACS using the DC/OS orchestrator, and while spinning up a cluster within a single region seems simple enough, I'm not quite sure what the best practice would be for doing deployments across multiple regions.
Azure itself does not seem to support deploying to more than one region right now. With that assumption, I guess my only other option is to create multiple, identical clusters in all the regions I wish to be available, and then use Azure Traffic Manager to route incoming traffic to the nearest available cluster.
While this solution works, it also causes a few issues I'm not 100% sure on how I should work around.
Our deployment pipelines must make sure to deploy to all regions when deploying a new version of a service. If we have a East US and North Europe region, during deployments from our CI tool I have to connect to the Marathon API in both regions to trigger the new deployments. If the deployment fails in one region, and succeeds in the other, I suddenly have a disparity between the two regions.
If i have a service using local persistent volumes deployed, let's say PostgreSQL or ElasticSearch, it needs to have instances in both regions since service discovery will only find services local to the region. That brings up the problem of replication between regions to keep all state in all regions; this seem to require some/a lot of manual configuration to get to work.
Has anyone ever used a setup somewhat like this using Azure Container Service (or really Amazon Container Service, as I assume the same challenges can be found there) and have some pointers on how to approach this?

You have multiple options for spinning up across regions. I would use a custom installation together with terraform for each of them. This here is a great starting point: https://github.com/bernadinm/terraform-dcos
Distributing agents across different regions should be no problem, ensuring that your services will keep running despite failures.
Distributing masters (giving you control over the services during failures) is a little more diffult as it involves distributing a zookeeper quorum across high latency links, so you should be careful in choosing the "distance" between regions.
Have a look at the documentation for more details.

You are correct ACS does not currently support Multi-Region deployments.
Your first issue is specific to Marathon in DC/OS, I'll ping some of the engineering folks over there to see if they have any input on best practice.
Your second point is something we (I'm the ACS PM) are looking at. There are some solutions you can use in certain scenarios (e.g. ArangoDB is in the DC/OS universe and will provide replication). The DC/OS team may have something to say here too. In ACS we are evaluating the best approaches to providing solutions for this use case but I'm afraid I can't give any indication of timeline.
An alternative solution is to have your database in a SaaS offering. This takes away all the complexity of managing redundancy and replication.

Related

AKS randomly change deployments and pods

I am investigating a robust way to scan my Azure AKS clusters and randomly change the numbers of pods, allocated resources, throttling and if possible limit connections to other resources (E.g. database, queues, cache).
The idea is to have this running against any environment (test, QA, live)
Log what changes where made and when
Email that the script has run
Return environment to desired state
My questions are:
Is there tooling for this already?
If this possible via CRON/ Azure pipelines?
This is part of my stress development work cycle that includes API integration and load testing to help find weakness and feedback ways we can improve our offering and teams reputation
Google "Kubernetes chaos engineering".
Look at Azure Chaos Studio https://azure.microsoft.com/en-us/products/chaos-studio/#overview
Create a chaos experiment that uses a Chaos Mesh fault to kill AKS pods with the Azure portal https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-tutorial-aks-portal

Can I share a k8s cluster securely between many DevOps product teams?

would there be a secure way to work with different product DevOps teams on the same k8s cluster? How can I isolate workloads between the teams? I know there is k8s rbac and namespaces available, but is that secure to run different prod workloads? I know istio but as I understood there is no direct answer to my Südasien. How can we handle different in ingress configuration from different teams in the same cluster? If not securely possible to isolate workloads how do you orchestrate k8s clusters to reduce maintenance.
Thanks a lot!
The answer is: it depends. First, Kubernetes is not insecure by default and containers give a base layer of abstraction. The better questions are:
How many isolation do you need?
Whats about user management?
Do you need to encrypt traffic between your workload?
Isolation Levels
If you need strong isolation between your workloads (and i mean really strong), do yourself a favor and use different clusters. There may be some business cases where you need guarantee that some kind of workload is not allowed to run on the same (virtual) machine. You could also try to do this by adding nodes that are only for one of your sub-projects and use Affinities and Anti-Affinities to handle the scheduling. But if need this level of isolation, you'll probably ran into problems when thinking about log aggregation, metrics or in general any point where you have a component that's used across all of your services.
For any other use case: Build one cluster and divide by namespaces. You could even create a couple ingress-controllers which belong just to one of your teams.
User Management
Managing RBAC and users by hand could be a little bit tricky. Kubernetes itself supports OIDC-Tokens. If you already use OIDC for SSO or similar, you could re-use your tokens to authenticate users in Kubernetes. I've never used this, so i can't tell about role mapping using OIDC.
Another solution would be Rancher or another cluster orchestrating tool. I can't tell about the other, but Rancher comes with built-in user management. You could also create projects to group several namespaces for one of your audiences.
Traffic Encryption
By using a service mesh like Istio or Linkerd you can encrypt traffic between your pods. Even if it sounds seductive to encrypt your workload, make clear if you really need this. Service meshes come with some downsides, e.g. resource usage. Also you have one more component that needs to be managed and updated.

What are the options to host Orleans on Azure without using the Cloud Services?

I want to host an Orleans project on Azure, but don't want to use the (classic) Cloud Services model (I want an ARM template project). The web app sample uses the old web / worker model - what is best option? There is a Service Fabric sample - is that the best route? The nearest equivalent to the web/worker model is VM Scale Sets - is that a well tested option?
IMO, app service is closet to web role.
Worker role however, depending on the point of view
From system architecture point of view, I think Scale Set is the closet. You get an identical set of VMs running your application. However you lost all management features. How your cluster handle application configurations, work loads on each node, service interruptions from server failure or deployments are pretty much DIY. Also you need to provision the VM with dependencies for your application.
From operations point of view, I think Service Fabric is the closest. It handles problems above but then you are dealing with design/implementation changes and learning curve from the added fabric layer in the architecture. Could be small, could be big depending on the complexity of your project. Besides, service fabric is still relatively new and nothing is for sure. Best case you follow the sample change a few lines of code and it works like a charm. Worst case you may want to complete refactor orleans solution into service fabric solution.
App service would be the easiest among the three. If it doesn't meet your requirement, I personally would try Service Fabric. Same reason why people are moving to cloud and you would opt for ARM solution.

Turning off ServiceFabric clusters overnight

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

GeoIP Routing with Windows Azure

I'm working on a somewhat large project that will eventually be loaded on Azure. The idea is we will have multiple compute nodes all over the world as our customer base is potentially that large. The question I have is this:
If I have nodes in the US, Europe, Asia, etc. for DR and load balancing reasons how can I combine the idea of Geo-based DNS results with Azure since our application will simply be a CNAME for our URL?
I'm not sure I quite understand the deployment strategy for one application running out of multiple regions with Azure. Does anyone have any links or references to better understand the model?
Mod Note: Not sure if this should be ServerFault but I thought StackOverflow was a better location.
Thanks,
Brent
Look at the Windows Azure Traffic Manager it allows you to group deployments across regions as one logical service and automatically routes a request to the nearest region.

Resources