Reduce costs of Azure availability set - azure

I am planning on running Sharepoint Foundation on one VM size A3 and SQL Server on another of size A6. As far as I understand this is not enough to achieve SLA and I should use 2 more instances - one for Sharepoint and one for SQL Server configured in 2 seperate availability sets.
Can I use scaling (by CPU usage) to turn off one instance and leave only one running at a time in an availability set? This would reduce the costs but I wonder if this solution will be good enough to achieve Azure's SLA. The way I see it one instance is running at a time while other one is shut down so I am billed for one instance. When there is an update or failure going on, the instance that until then has been running is shut down and the other one comes online. Is this the way it works? Can I cut costs of availability sets like this?

no, the SLA requires two running instances. However, if you want to control your costs, the approach you have in place will work. Just keep in mind that the duration/window for a disruption will be dependent on how quickly you detect that the primary VM has failed, and how fast you can start the secondary VM. And depending on the nature of the service disruption, it may not be possible for you to start the secondary. So its a risk.

Related

Choosing the right EC2 instance for three NodeJS Applications

I'm running three MEAN stack programmes. Each application receives over 10,000 monthly users. Could you please assist me in finding an EC2 instance for my apps?
I've been using a "t3.large" instance with two vCPUs and eight gigabytes of RAM, but it costs $62 to $64 per month.
I need help deciding which EC2 instance to use for three Nodejs applications.
First check CloudWatch metrics for the current instances. Is CPU and memory usage consistent over time? Analysing the metrics could help you to decide whether you should select a smaller/bigger instance or not.
One way to avoid too unnecessary costs is to use auto scaling groups and load balancers. By using them and finding and applying proper settings, you could have always right amount of computing power for your applications.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html
https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html
Depends on your applications. If your apps need more compute power or more memory or more storage? Deciding a server is similar to installing an app on system. Check what are basic requirements for it & then proceed to choose server.
If you have 10k+ monthly customers, think about using ALB so that traffic gets distributed evenly. Try caching to server some content if possible. Use unlimited burst mode of t3 servers if CPU keeps hitting 100%. Also, try to optimize code so that fewer resources are consumed. Once you are comfortable with ec2 choice, try to purchase saving plans or RIs for less cost.
Also, do monitor the servers & traffic using Cloudwatch agent, internet monitor etc features.

Horizontal/Vertical scaling of self hosted integration runtime

We're looking for automated way to horizontally, vertically scale the pull of self hosted integration runtime virtual machines used in ADF.
Reading Microsoft docs does not provide answer.
Well, I don't have the experience, so I can only give you a theoretical answer, but maybe it's helpfull for you.
AFAIK, neither way is configurable out-of-the-box. For scale-out you'll have to deploy an additional IR machine yourself. So probably you'll want to create an image that you can provision from docker or kubernetes and has the IR and pre-requirements installed. The IR installation provides an PowerShell script that can be used to create an automated connection.
For scale-up/down, you'll have to run some script that scales your vm. In an IaaS solution (f.e.) Azure VM, that should be doable with an API call to change your VM.
For both cases you'll have to have some kind of montitor in place that monitors the IR loads and makes changess as needed. I think the measures provided in the Data Factory should do. Maybe you can use Log Analyics to monitor the loads.
I'm curious about your use case for this.
My solution is just for scaling out/in since the VM must be restarted if you are scaling up/down, which causes downtime and job failures etc.
At a high level this solution requires just 3 simple things:
Azure Metric Alert that fires when Scale-Out should occur (VM Start)
Azure Metric Alert that fires when Scale-In should occur (VM Deallocation)
Logic App that is triggered by Azure Alert and actually executes the Start/Stop of the VM, along with any other automation associated with this (eg posting to a Teams channel when Scale in/out occurs)
Here are more of the details surrounding how we setup the conditions for the alerts, but the main thing to keep in mind is (IR CPU %, IR queue length, Number of Nodes, and possibly IR Memory)
Scale-Out
Scale-In
Actions for Alerts
As you can see below we have the alert triggering 1 Logic App, using the payload that is passed to the Logic App, you can determine if the Logic App should be starting the VM, or stopping the VM. (As well as any other additional actions)
Logic App
There is a small chance that due to timing (and depending on how many ADF's the IR is shared to), that pipeline activities could be sent to Node 2 at the same time a deallocation command is sent to the VM for Node 2. I have not seen this as of yet, but adjusting the alert conditions based on your need could help avoid this. Feel free to play around with the conditions of the alerts, granularity, thresholds, etc. This is not a one size fits all solution.

How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API?

I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.
In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.

Is auto-scaling at a VM-level (cell-level) possible in Cloud Foundry? If yes, how can this be achieved?

I have seen two levels of scaling instances in open-source Cloud Foundry.
cf scale -i INSTANCES
cf scale -m MEMORY -k DISK
Is there something available for a cell-level auto-scaling in CF? e.g. If I have 5 instances of an app running and I want to launch 15 more but the current no. of cell VMs that are running have a capacity of running only 15 instances in total. Can I use an existing service that recognises that the load to be served would need one more cell to be launched and spawn another machine?
I'm looking to deploy CF on Azure, so Azure-specific solution would also help.
I think the short answer is no (at least at the time of me writing this). Usually, Cloud Foundry is deployed using Bosh and Bosh does not have an auto scaling feature.
The way that a CF platform is typically managed is that as a CF operator, you would have monitoring setup so that you can see the capacity of your platform (there are metrics that tell you how much capacity is left on your Cells) and also alert when your platform hits certain capacity limits. When you reach these, you can then use Bosh to scale up or down the number of Cells accordingly. This would be a manual operation with Bosh though.
Having said that, I suppose there's nothing to stop you from using the alerts to automatically trigger Bosh to scale up or down the Cells, there's just nothing (as of me writing this) to do that out-of-the-box (i.e. part of Bosh itself).
Hope that helps!

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Resources