What does the Azure Web Apps architecture look like? - azure

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.

My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)

Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Related

Horizontal/Vertical scaling of self hosted integration runtime

We're looking for automated way to horizontally, vertically scale the pull of self hosted integration runtime virtual machines used in ADF.
Reading Microsoft docs does not provide answer.
Well, I don't have the experience, so I can only give you a theoretical answer, but maybe it's helpfull for you.
AFAIK, neither way is configurable out-of-the-box. For scale-out you'll have to deploy an additional IR machine yourself. So probably you'll want to create an image that you can provision from docker or kubernetes and has the IR and pre-requirements installed. The IR installation provides an PowerShell script that can be used to create an automated connection.
For scale-up/down, you'll have to run some script that scales your vm. In an IaaS solution (f.e.) Azure VM, that should be doable with an API call to change your VM.
For both cases you'll have to have some kind of montitor in place that monitors the IR loads and makes changess as needed. I think the measures provided in the Data Factory should do. Maybe you can use Log Analyics to monitor the loads.
I'm curious about your use case for this.
My solution is just for scaling out/in since the VM must be restarted if you are scaling up/down, which causes downtime and job failures etc.
At a high level this solution requires just 3 simple things:
Azure Metric Alert that fires when Scale-Out should occur (VM Start)
Azure Metric Alert that fires when Scale-In should occur (VM Deallocation)
Logic App that is triggered by Azure Alert and actually executes the Start/Stop of the VM, along with any other automation associated with this (eg posting to a Teams channel when Scale in/out occurs)
Here are more of the details surrounding how we setup the conditions for the alerts, but the main thing to keep in mind is (IR CPU %, IR queue length, Number of Nodes, and possibly IR Memory)
Scale-Out
Scale-In
Actions for Alerts
As you can see below we have the alert triggering 1 Logic App, using the payload that is passed to the Logic App, you can determine if the Logic App should be starting the VM, or stopping the VM. (As well as any other additional actions)
Logic App
There is a small chance that due to timing (and depending on how many ADF's the IR is shared to), that pipeline activities could be sent to Node 2 at the same time a deallocation command is sent to the VM for Node 2. I have not seen this as of yet, but adjusting the alert conditions based on your need could help avoid this. Feel free to play around with the conditions of the alerts, granularity, thresholds, etc. This is not a one size fits all solution.

Confused about how Azure App Service Local Cache is useful, lacking any use cases

I've read all of the documentation about App Service Local Cache but I am struggling to see how it is useful. It claims to basically create a read-only copy of your Site directory, which for an MVC app is basically the whole app. But I can't find any information about use cases or why you'd want to do this.
I ask because it's been suggested that we move to implementing it, and I am trying to work out why we should do this.
I can see advantages if you do lots of reading/writing to disk but hardly any apps do that these days, we just use the database for everything, and logging goes directly to OMS.
Am I missing something major about this feature? To make my question non-vague, does this feature offer something useful for a simple MVC website that displays data from a database and writes back to the database?
Even if your app doesn't perform a large amount of I/O operations, you can still benefit from using App Service Local Cache due to:
Quicker app restarts (since files are local, latency to the shared network drive is removed). Helpful for app settings updates.
Less application downtime if your app loses connectivity with the shared network drive (which causes restarts), which can happen during Azure update/patch operations on the underlying VM
More are discussed in the Channel 9 video for Local Cache https://channel9.msdn.com/Shows/Cloud+Cover/Episode-201-Azure-Web-App-Local-Cache-with-Cory-Fowler

Windows Azure reliability (my server just lost its drives & sites, then 20 minutes later they reappeared)

I am on a Windows Azure trial to evaluate migrating a number of commercial ASP.NET sites to Azure from dedicated hosting. All was going OK ... until just now!
Some background - the sites are set up under Web Roles (i.e. as opposed to Web Sites) using SQL Azure and SQL Reporting. The site content was under the X: drive (there was also a B: drive that seemed to be mapped to the same location). There are several days left of the trial.
Without any apparent warning my test sites suddenly stopped working. Examining the server (through RDP) I saw that the B: and X: drives had disappeared (just C: D & E: I think were left), and in IIS the application pools and Sites had disappeared. In the Portal however, nothing seemed to have changed - the same services & config seemed to be there.
Then about 20 minutes later the missing drives, app pools and sites reappeared and my test sites started working again! However, the B: drive was gone and now there was an F: drive (showing the same as X:); also the MS ReportViewer 2008 control that I had installed earlier in the day was gone. It is almost as if the server had been replaced with another (but the IIS config was restored from the original).
As you can imagine, this makes me worried! If this is something that could happen in production there is no way I would consider hosting commercial sites for clients on Azure (unless there is some redundancy system available to keep a site up when such a failure occurs).
Can anyone explain what may have happened, if this is possible/predictable under a live subscription, and if so how to work around it?
One other thing to keep in mind is that an Azure Web Role is not persistent. I'm not sure how you installed the MS Report Viewer 2008 control but anything you add or install outside of a deployment package when you push your solution to Azure is not guaranteed to be available at some future point.
I admit that I don't fully understand the full picture when it comes to the overall architecture of Azure but I do know that Web Roles can and do re-create themselves from time to time. When the role recycles, it returns to the state as it was when it was installed. This is why Microsoft suggests using at least 2 instances of your role because while one or the other may recycle they will never recycle both at the same time, part of what guarantees the 99.9% uptime.
You might also want to consider an Azure VM. They are persistent but require you to maintain the server in terms of updates and software much in the way I suspect you are already doing with your dedicated hosting.
I've been hosting my solution in a large (4 core) web role, also using SQL Azure, for about two years and have had great success with it. I have roughly 3,000 users and rarely see the utilization of my web role go over 2% (meaning I've got a lot of room to grow). Overall it is a great hosting solution in my opinion.
According to the Azure SLA Microsoft guarantees up time of 99.9% or higher on all its products per billing month. (20 min on the month would be .0004% loss, not being critical, just suggesting that they are still within their SLA)
Current status shows that sql databases were having issues in the US north last night, but all services appear to be up currently
Personally, I have seen the dashboard go down, and report very weird problems, but the services that I programmed to worked just fine all the way through it. When I experienced this problem it was reported on the Azure Status, the platform status and the twitter feed
While I have seen bumps, they are few and far between, and I find reliability to be perceptibly higher than other providers that I have worked with.
As for workarounds I would suggest a standard mode for your websites and increasing instances of the site. You might try looking into the new add ins that are available with the latest Azure release. Active Cloud Monitoring by Metrichub might be what you require.
It sounds like you're expecting the web role to act as a Virtual Machine instance.
Web Roles aren't persistent (the machine can be destroyed and recreated at any time), so you should do any additional required set up as a 'startup task' in your Azure project (never install software manually).
Because of this you need at least 2 instances so that rolling upgrades (i.e. Windows security patches, hotfixes and so on) can be performed automatically without having your entire deployment taken offline.
If this doesn't suit your use case then you should look at Azure Virtual Machines, but you'll need to manage updates and so on yourself. It's usually better to use Web Roles properly as you can then do scaling and so on a lot more easily.

Create azure VM on my local machine

Is it possible to create one or several azure VMs on my local machine? I want to create a web app and load test it locally, without the need of putting it in the cloud. I'm thinking at the following scenario: I have a local VM running a IIS server with my web app; I use a tool to generate a lot of load; I need to deploy the second VM containing the same things as the first VM. The downtime of the web app should be equal to 0(hopefully).
Clarification(update):
I want to achieve the following: create a web app and a monitoring app(CPU,Memory) and deploy them on one VM. On a load test, if the VM cannot handle it(e.g. CPU goes above 80%), I want to programmatically deploy a new VM(with the same configuration, having both the web app and the monitoring app), such that no downtime occurs.
Azure has several ways for you to host sites.
Virtual Machines is just that, normal VMs. You can create them locally and upload them, but everything is up to you, including how to handle upgrades. If that is what you need to do then I don't know how you would handle upgrades with no down time; though, you can add multiple VMs to a load balancer and then upgrade them one at a time.
It sounds like what you really want to explore is Cloud Services. You can run one or more VMs locally in the emulator, upgrade with no down time once in the cloud, implement auto scaling (you will have to use a tool or write some code).
Alternatively you may want to look at Azure Web sites, but that is a completely different concept and you can't really test load and load balancing locally the same way.
Based on your statement that you essentially want to auto-scale your application you want to look at Cloud Services with Auto Scaling. However, you can't fully test this in the cloud emulator - but you can test your logic.
Background
Azure Cloud Services is designed for this kind of thing; You don't really work with VMs in the way you may be used to, instead you create a package that Azure then deploys to as many servers as you like. Once up and running, you can manually go into the management console and increase or decrease the number of active servers simply by moving a slider. Of course, you want to do this automatically, so you have a few options.
There is a management API you can use to change the number of servers. So, it would be quite simple to write a bit of code that you spin up in another thread from WebRole.Start and that simply sits and monitors the CPU on the machine and then calls the management API to spin up a new server instance if your CPU goes over a certain treshold. Okay, locally you can only test that the call to the management API is made, you won't actually see the new server coming up. But, if you grab your free trial of Azure and just try it you will see that you really don't need to test that part - it just works.
However, in practice there is an awful lot more to auto scaling. Here are some of the things you need to consider;
Even relatively idle web servers will often spike briefly to 100% so just having a simple treshold is unlikely to be good enough; You need to decide on how long the server needs to be over a certain treshold before you spin up another server instance.
What happens when you have more than one server? And, on Azure, you should always have at least two servers to ensure you have resilience. Note that the idea with Cloud Services really is to have many small servers rather than a few big servers. You pay per core, not per number of servers.
Imagine you currently have three servers and one is really busy for some reason and the other two are idle. Do you want to spin up a fourth server?
Imagine you currently have two servers and they are both quite busy. Do you really want them both to start a new server so you end up with four servers running?
There are several ways to handle these challenges. For starters, rather than having monitor programs running locally on each server, you are better of moving that monitoring outside; Azure comes with the ability to dump performance metrics to table storage at whatever interval you choose. You can then run an external program that retrieves the performance data over time from all your current servers and then reason about the overall workload before deciding to spin up or shut down additional servers. Now, you can of course host that external monitor program in a separate thread on each of your webroles to give your monitoring resilience - but the key point is that the monitoring program doesn't monitor the server it runs on, it monitors all the servers. You will, of course, still have to deal with stopping multiple monitoring program instances from all starting and stopping servers. One way to do is to place stop/start commands onto an Azure "message queue" (there are a few different types) and use the built-in "de-duper" which will automatically delete identical commands that are put on the queue within a certain time window (I am over simplyfing but you get the idea).
The actual answer
Really, though, you want to look at the Auto Scaling Application Block which will do most of this for you. I guess that is the real answer to your question, but I wanted to provide a bit of context first.
Again, I recognise you asked for how to test this locally - but I believe that that question doesn't really make sense in the context of Azure and I hope the above information helps.
I'm pretty sure you can't do that and it wouldn't make sense anyway. If you want load testing, you need to run that in an environment as similar to production as possible and that means you have to run your application is Azure cloud. How else do you know that the load will actually be processed fine on real cloud?

Architecture recommendation for load-balanced ASP.NET site

UPDATE 2009-05-21
I've been testing the #2 method of using a single network share. It is resulting in some issues with Windows Server 2003 under load:
http://support.microsoft.com/kb/810886
end update
I've received a proposal for an ASP.NET website that works as follows:
Hardware load-balancer -> 4 IIS6 web servers -> SQL Server DB with failover cluster
Here's the problem...
We are choosing where to store the web files (aspx, html, css, images). Two options have been proposed:
1) Create identical copies of the web files on each of the 4 IIS servers.
2) Put a single copy of the web files on a network share accessible by the 4 web servers. The webroots on the 4 IIS servers will be mapped to the single network share.
Which is the better solution?
Option 2 obviously is simpler for deployments since it requires copying files to only a single location. However, I wonder if there will be scalability issues since four web servers are all accessing a single set of files. Will IIS cache these files locally? Would it hit the network share on every client request?
Also, will access to a network share always be slower than getting a file on a local hard drive?
Does the load on the network share become substantially worse if more IIS servers are added?
To give perspective, this is for a web site that currently receives ~20 million hits per month. At recent peak, it was receiving about 200 hits per second.
Please let me know if you have particular experience with such a setup. Thanks for the input.
UPDATE 2009-03-05
To clarify my situation - the "deployments" in this system are far more frequent than a typical web application. The web site is the front end for a back office CMS. Each time content is published in the CMS, new pages (aspx, html, etc) are automatically pushed to the live site. The deployments are basically "on demand". Theoretically, this push could happen several times within a minute or more. So I'm not sure it would be practical to deploy one web server at time. Thoughts?
I'd share the load between the 4 servers. It's not that many.
You don't want that single point of contention either when deploying nor that single point of failure in production.
When deploying, you can do them 1 at a time. Your deployment tools should automate this by notifying the load balancer that the server shouldn't be used, deploying the code, any pre-compilation work needed, and finally notifying the load balancer that the server is ready.
We used this strategy in a 200+ web server farm and it worked nicely for deploying without service interruption.
If your main concern is performance, which I assume it is since you're spending all this money on hardware, then it doesn't really make sense to share a network filesystem just for convenience sake. Even if the network drives are extremely high performing, they won't perform as well as native drives.
Deploying your web assets are automated anyway (right?) so doing it in multiples isn't really much of an inconvenience.
If it is more complicated than you're letting on, then maybe something like DeltaCopy would be useful to keep those disks in sync.
One reason the central share is bad is because it makes the NIC on the share server the bottleneck for the whole farm and creates a single point of failure.
With IIS6 and 7, the scenario of using a network single share across N attached web/app server machines is explicitly supported. MS did a ton of perf testing to make sure this scenario works well. Yes, caching is used. With a dual-NIC server, one for the public internet and one for the private network, you'll get really good performance. The deployment is bulletproof.
It's worth taking the time to benchmark it.
You can also evaluate a ASP.NET Virtual Path Provider, which would allow you to deploy a single ZIP file for the entire app. Or, with a CMS, you could serve content right out of a content database, rather than a filesystem. This presents some really nice options for versioning.
VPP For ZIP via #ZipLib.
VPP for ZIP via DotNetZip.
In an ideal high-availability situation, there should be no single point of failure.
That means a single box with the web pages on it is a no-no. Having done HA work for a major Telco, I would initially propose the following:
Each of the four servers has it's own copy of the data.
At a quiet time, bring two of the servers off-line (i.e., modify the HA balancer to remove them).
Update the two off-line servers.
Modify the HA balancer to start using the two new servers and not the two old servers.
Test that to ensure correctness.
Update the two other servers then bring them online.
That's how you can do it without extra hardware. In the anal-retentive world of the Telco I worked for, here's what we would have done:
We would have had eight servers (at the time, we had more money than you could poke a stick at). When the time came for transition, the four offline servers would be set up with the new data.
Then the HA balancer would be modified to use the four new servers and stop using the old servers. This made switchover (and, more importantly, switchback if we stuffed up) a very fast and painless process.
Only when the new servers had been running for a while would we consider the next switchover. Up until that point, the four old servers were kept off-line but ready, just in case.
To get the same effect with less financial outlay, you could have extra disks rather than whole extra servers. Recovery wouldn't be quite as quick since you'd have to power down a server to put the old disk back in, but it would still be faster than a restore operation.
Use a deployment tool, with a process that deploys one at a time and the rest of the system keeps working (as Mufaka said). This is a tried process that will work with both content files and any compiled piece of the application (which deploy causes a recycle of the asp.net process).
Regarding the rate of updates this is something you can control. Have the updates go through a queue, and have a single deployment process that controls when to deploy each item. Notice this doesn't mean you process each update separately, as you can grab the current updates in the queue and deploy them together. Further updates will arrive to the queue, and will be picked up once the current set of updates is over.
Update: About the questions in the comment. This is a custom solution based on my experience with heavy/long processes which needs their rate of updates controlled. I haven't had the need to use this approach for deployment scenarios, as for such dynamic content I usually go with a combination of DB and cache at different levels.
The queue doesn't need to hold the full information, it just need to have the appropriate info (ids/paths) that will let your process pass the info to start the publishing process with an external tool. As it is custom code, you can have it join the information to be published, so you don't have to deal with that in the publishing process/tool.
The DB changes would be done during the publishing process, again you just need to know where the info for the required changes is and let the publishing process/tool handle it. Regarding what to use for the queue, the main ones I have used is msmq and a custom implementation with info in sql server. The queue is just there to control the rate of the updates, so you don't need anything specially targeted at deployments.
Update 2: make sure your DB changes are backwards compatible. This is really important, when you are pushing changes live to different servers.
I was in charge of development for a game website that had 60 million hits a month. The way we did it was option #1. User did have the ability to upload images and such and those were put on a NAS that was shared between the servers. It worked out pretty well. I'm assuming that you are also doing page caching and so on, on the application side of the house. I would also deploy on demand, the new pages to all servers simultaneously.
What you gain on NLB with the 4IIS you loose it with the BottleNeck with the app server.
For scalability I'll recommend the applications on the front end web servers.
Here in my company we are implementing that solution. The .NET app in the front ends and an APP server for Sharepoint + a SQL 2008 Cluster.
Hope it helps!
regards!
We have a similar situation to you and our solution is to use a publisher/subscriber model. Our CMS app stores the actual files in a database and notifies a publishing service when a file has been created or updated. This publisher then notifies all the subscribing web applications and they then go and get the file from the database and place it on their file systems.
We have the subscribers set in a config file on the publisher but you could go the whole hog and have the web app do the subscription itself on app startup to make it even easier to manage.
You could use a UNC for the storage, we chose a DB for convenience and portability between or production and test environments (we simply copy the DB back and we have all the live site files as well as the data).
A very simple method of deploying to multiple servers (once the nodes are set up correctly) is to use robocopy.
Preferably you'd have a small staging server for testing and then you'd 'robocopy' to all deployment servers (instead of using a network share).
robocopy is included in the MS ResourceKit - use it with the /MIR switch.
To give you some food for thought you could look at something like Microsoft's Live Mesh
. I'm not saying it's the answer for you but the storage model it uses may be.
With the Mesh you download a small Windows Service onto each Windows machine you want in your Mesh and then nominate folders on your system that are part of the mesh. When you copy a file into a Live Mesh folder - which is the exact same operation as copying to any other foler on your system - the service takes care of syncing that file to all your other participating devices.
As an example I keep all my code source files in a Mesh folder and have them synced between work and home. I don't have to do anything at all to keep them in sync the action of saving a file in VS.Net, notepad or any other app initiates the update.
If you have a web site with frequently changing files that need to go to multiple servers, and presumably mutliple authors for those changes, then you could put the Mesh service on each web server and as authors added, changed or removed files the updates would be pushed automatically. As far as the authors go they would just be saving their files to a normal old folder on their computer.
Assuming your IIS servers are running Windows Server 2003 R2 or better, definitely look into DFS Replication. Each server has it's own copy of the files which eliminates a shared network bottleneck like many others have warned against. Deployment is as simple as copying your changes to any one of the servers in the replication group (assuming a full mesh topology). Replication takes care of the rest automatically including using remote differential compression to only send the deltas of files that have changed.
We're pretty happy using 4 web servers each with a local copy of the pages and a SQL Server with a fail over cluster.

Resources