We have multiple terraform scripts, that create/update hundreds of resources on azure. If we want to change anything on api management related resources, it takes ages and regularly even times out. Running it again sometimes solves issues, but also sometimes tells us, that the api we want to create already exists and stuff like that.
The customer is getting really annoyed by us providing unreliable update-scripts that cause quite some efforts for the operations team, that is responsible of deploying and running the whole product. Saving changes in the api management is also taking ages and running into errors when we use the azure portal.
Is there any trick or clue on how to improve our situation?
(This is going on for a while now and feels like getting worse and worse over the time)
I'd start by using the Debugging options to sort out precisely which resources are taking the longest. You can consider breaking those out into a separate state, so you don't have to calculate them each time.
Next, ensure that the running process has timeouts set greater than those of terraform. Killing terraform mid-run is a good way to end up with a confused state.
Aside from that, there are some resources for which you can provide Operation Timeouts. With those you can ensure terraform treats them as failed before the process running terraform kills it (if they are available).
I'd consider opening a bug on the azurerm provider or asking in the Terraform Section of the Community Forum.
Azure API Management is slow in applying changes because it's a distributed service. An update operation will take time as it waits until the changes are applied to all instances. If you are facing similar issues in the portal it's a sign that it has nothing to do with Terraform or AzureRM. I would contact Azure support, as they will have the telemetry to help you further.
In my personal experience, a guaranteed way to get things stuck is to do a lot of small changes in succession without waiting for the previous ones to finish so I would start by checking that.
Finally, if you find no help in the previous steps, I would try using Bicep/ARM to manage the APIM. Usually, the ARM deployment API is a bit more robust compared to the APIs used by Terraform & GO SDK.
I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.
I've read all of the documentation about App Service Local Cache but I am struggling to see how it is useful. It claims to basically create a read-only copy of your Site directory, which for an MVC app is basically the whole app. But I can't find any information about use cases or why you'd want to do this.
I ask because it's been suggested that we move to implementing it, and I am trying to work out why we should do this.
I can see advantages if you do lots of reading/writing to disk but hardly any apps do that these days, we just use the database for everything, and logging goes directly to OMS.
Am I missing something major about this feature? To make my question non-vague, does this feature offer something useful for a simple MVC website that displays data from a database and writes back to the database?
Even if your app doesn't perform a large amount of I/O operations, you can still benefit from using App Service Local Cache due to:
Quicker app restarts (since files are local, latency to the shared network drive is removed). Helpful for app settings updates.
Less application downtime if your app loses connectivity with the shared network drive (which causes restarts), which can happen during Azure update/patch operations on the underlying VM
More are discussed in the Channel 9 video for Local Cache https://channel9.msdn.com/Shows/Cloud+Cover/Episode-201-Azure-Web-App-Local-Cache-with-Cory-Fowler
I've successfully implemented Umbraco 4.7 to a Windows Azure Website and SQL Azure, but sometimes I get errors similar to this one:
SQL helper exception in ExecuteScalar ---> System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) --->
It seems that Umbraco does not manage retry logic (sql azure transient fault handling). Does anybody know of any non-traumatic way to implement this on the umbraco side?
Amhed, There is no easy way todo this, people using Umbraco in Azure have generally come up with session state problems that are causing connection issues, but as you stated you are only catching a 1% which means you are picking up transient errors. You will have probably seen the discussion here
http://our.umbraco.org/forum/core/general/27179-SQL-Azure-connectivity-issues?p=1
The net effect is that you have to take the umbraco code base and rebuild it with a retry framework in it. That comes with a serious overhead to using Umbraco and taking future releases. You would best lobbying Umbraco core team to put a full retry framework in place and supporting it. (Lets not even talk about security issues ;)
This probably not what you want to hear but effectively it is roll your own datalayer time.
Having said all that I went and had a look: ;) (Because this does interest me for something else)
As starting point though from looking at the source in Umbraco they are using
Microsoft.ApplicationBlocks.Data
As the base for making connections and executing SQL commands
from Looking at the umbraco.DataLayer.SqlHelpers.SqlServer.SqlServerHelper
using SH = Microsoft.ApplicationBlocks.Data.SqlHelper;
So my guess is that you would need to replace this block. A quick (dirty) search on the internet you get
http://www.sharpdeveloper.net/source/SqlHelper-Source-Code-cs.html
You could reflect out the class though to make sure.
This then you could rebuild out with a transient fault handling framework and then you could effectively drop in potential as a change of dll (famous last words).
But you could at least get
using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, retryPolicy))
easily going in that type of class and more lovely stuff.
http://msdn.microsoft.com/en-us/library/hh680899(v=pandp.50).aspx
If I was going to do this, this is where i would start from and I would go in at that level.
I am not sure if that would cover 100% the connection set as I don't work actively in umbraco code base so this is best guess but looking at source this were I would start from and change out and looks to be your best starting point.
hths,
James
UPDATE 2009-05-21
I've been testing the #2 method of using a single network share. It is resulting in some issues with Windows Server 2003 under load:
http://support.microsoft.com/kb/810886
end update
I've received a proposal for an ASP.NET website that works as follows:
Hardware load-balancer -> 4 IIS6 web servers -> SQL Server DB with failover cluster
Here's the problem...
We are choosing where to store the web files (aspx, html, css, images). Two options have been proposed:
1) Create identical copies of the web files on each of the 4 IIS servers.
2) Put a single copy of the web files on a network share accessible by the 4 web servers. The webroots on the 4 IIS servers will be mapped to the single network share.
Which is the better solution?
Option 2 obviously is simpler for deployments since it requires copying files to only a single location. However, I wonder if there will be scalability issues since four web servers are all accessing a single set of files. Will IIS cache these files locally? Would it hit the network share on every client request?
Also, will access to a network share always be slower than getting a file on a local hard drive?
Does the load on the network share become substantially worse if more IIS servers are added?
To give perspective, this is for a web site that currently receives ~20 million hits per month. At recent peak, it was receiving about 200 hits per second.
Please let me know if you have particular experience with such a setup. Thanks for the input.
UPDATE 2009-03-05
To clarify my situation - the "deployments" in this system are far more frequent than a typical web application. The web site is the front end for a back office CMS. Each time content is published in the CMS, new pages (aspx, html, etc) are automatically pushed to the live site. The deployments are basically "on demand". Theoretically, this push could happen several times within a minute or more. So I'm not sure it would be practical to deploy one web server at time. Thoughts?
I'd share the load between the 4 servers. It's not that many.
You don't want that single point of contention either when deploying nor that single point of failure in production.
When deploying, you can do them 1 at a time. Your deployment tools should automate this by notifying the load balancer that the server shouldn't be used, deploying the code, any pre-compilation work needed, and finally notifying the load balancer that the server is ready.
We used this strategy in a 200+ web server farm and it worked nicely for deploying without service interruption.
If your main concern is performance, which I assume it is since you're spending all this money on hardware, then it doesn't really make sense to share a network filesystem just for convenience sake. Even if the network drives are extremely high performing, they won't perform as well as native drives.
Deploying your web assets are automated anyway (right?) so doing it in multiples isn't really much of an inconvenience.
If it is more complicated than you're letting on, then maybe something like DeltaCopy would be useful to keep those disks in sync.
One reason the central share is bad is because it makes the NIC on the share server the bottleneck for the whole farm and creates a single point of failure.
With IIS6 and 7, the scenario of using a network single share across N attached web/app server machines is explicitly supported. MS did a ton of perf testing to make sure this scenario works well. Yes, caching is used. With a dual-NIC server, one for the public internet and one for the private network, you'll get really good performance. The deployment is bulletproof.
It's worth taking the time to benchmark it.
You can also evaluate a ASP.NET Virtual Path Provider, which would allow you to deploy a single ZIP file for the entire app. Or, with a CMS, you could serve content right out of a content database, rather than a filesystem. This presents some really nice options for versioning.
VPP For ZIP via #ZipLib.
VPP for ZIP via DotNetZip.
In an ideal high-availability situation, there should be no single point of failure.
That means a single box with the web pages on it is a no-no. Having done HA work for a major Telco, I would initially propose the following:
Each of the four servers has it's own copy of the data.
At a quiet time, bring two of the servers off-line (i.e., modify the HA balancer to remove them).
Update the two off-line servers.
Modify the HA balancer to start using the two new servers and not the two old servers.
Test that to ensure correctness.
Update the two other servers then bring them online.
That's how you can do it without extra hardware. In the anal-retentive world of the Telco I worked for, here's what we would have done:
We would have had eight servers (at the time, we had more money than you could poke a stick at). When the time came for transition, the four offline servers would be set up with the new data.
Then the HA balancer would be modified to use the four new servers and stop using the old servers. This made switchover (and, more importantly, switchback if we stuffed up) a very fast and painless process.
Only when the new servers had been running for a while would we consider the next switchover. Up until that point, the four old servers were kept off-line but ready, just in case.
To get the same effect with less financial outlay, you could have extra disks rather than whole extra servers. Recovery wouldn't be quite as quick since you'd have to power down a server to put the old disk back in, but it would still be faster than a restore operation.
Use a deployment tool, with a process that deploys one at a time and the rest of the system keeps working (as Mufaka said). This is a tried process that will work with both content files and any compiled piece of the application (which deploy causes a recycle of the asp.net process).
Regarding the rate of updates this is something you can control. Have the updates go through a queue, and have a single deployment process that controls when to deploy each item. Notice this doesn't mean you process each update separately, as you can grab the current updates in the queue and deploy them together. Further updates will arrive to the queue, and will be picked up once the current set of updates is over.
Update: About the questions in the comment. This is a custom solution based on my experience with heavy/long processes which needs their rate of updates controlled. I haven't had the need to use this approach for deployment scenarios, as for such dynamic content I usually go with a combination of DB and cache at different levels.
The queue doesn't need to hold the full information, it just need to have the appropriate info (ids/paths) that will let your process pass the info to start the publishing process with an external tool. As it is custom code, you can have it join the information to be published, so you don't have to deal with that in the publishing process/tool.
The DB changes would be done during the publishing process, again you just need to know where the info for the required changes is and let the publishing process/tool handle it. Regarding what to use for the queue, the main ones I have used is msmq and a custom implementation with info in sql server. The queue is just there to control the rate of the updates, so you don't need anything specially targeted at deployments.
Update 2: make sure your DB changes are backwards compatible. This is really important, when you are pushing changes live to different servers.
I was in charge of development for a game website that had 60 million hits a month. The way we did it was option #1. User did have the ability to upload images and such and those were put on a NAS that was shared between the servers. It worked out pretty well. I'm assuming that you are also doing page caching and so on, on the application side of the house. I would also deploy on demand, the new pages to all servers simultaneously.
What you gain on NLB with the 4IIS you loose it with the BottleNeck with the app server.
For scalability I'll recommend the applications on the front end web servers.
Here in my company we are implementing that solution. The .NET app in the front ends and an APP server for Sharepoint + a SQL 2008 Cluster.
Hope it helps!
regards!
We have a similar situation to you and our solution is to use a publisher/subscriber model. Our CMS app stores the actual files in a database and notifies a publishing service when a file has been created or updated. This publisher then notifies all the subscribing web applications and they then go and get the file from the database and place it on their file systems.
We have the subscribers set in a config file on the publisher but you could go the whole hog and have the web app do the subscription itself on app startup to make it even easier to manage.
You could use a UNC for the storage, we chose a DB for convenience and portability between or production and test environments (we simply copy the DB back and we have all the live site files as well as the data).
A very simple method of deploying to multiple servers (once the nodes are set up correctly) is to use robocopy.
Preferably you'd have a small staging server for testing and then you'd 'robocopy' to all deployment servers (instead of using a network share).
robocopy is included in the MS ResourceKit - use it with the /MIR switch.
To give you some food for thought you could look at something like Microsoft's Live Mesh
. I'm not saying it's the answer for you but the storage model it uses may be.
With the Mesh you download a small Windows Service onto each Windows machine you want in your Mesh and then nominate folders on your system that are part of the mesh. When you copy a file into a Live Mesh folder - which is the exact same operation as copying to any other foler on your system - the service takes care of syncing that file to all your other participating devices.
As an example I keep all my code source files in a Mesh folder and have them synced between work and home. I don't have to do anything at all to keep them in sync the action of saving a file in VS.Net, notepad or any other app initiates the update.
If you have a web site with frequently changing files that need to go to multiple servers, and presumably mutliple authors for those changes, then you could put the Mesh service on each web server and as authors added, changed or removed files the updates would be pushed automatically. As far as the authors go they would just be saving their files to a normal old folder on their computer.
Assuming your IIS servers are running Windows Server 2003 R2 or better, definitely look into DFS Replication. Each server has it's own copy of the files which eliminates a shared network bottleneck like many others have warned against. Deployment is as simple as copying your changes to any one of the servers in the replication group (assuming a full mesh topology). Replication takes care of the rest automatically including using remote differential compression to only send the deltas of files that have changed.
We're pretty happy using 4 web servers each with a local copy of the pages and a SQL Server with a fail over cluster.