I'm running a single (not scaled) solr instance on a Azure App Service.
The App Service runs Java 8 and a Jetty 9.3 container.
Everything works really well, but when Azure decides to swap to another VM sometimes the JVM doesn't seem to shutdown gracefully and we encounter issues.
One of the reasons for Azure to decide to swap to another VM is infrastructure maintenance. For example Windows Updates are installed and your app is moved to another machine.
To prevent downtime Azure spins up the new app and when it's ready it will swap over to the new app. Seems fine, but this does not seem to work well with solr's locking mechanism.
We are using the default native lockType, which should be fine since we're only running a single instance. Solr should remove the write.lock file during shutdown, but this does not seem to happen all of the time.
The Azure Diagnostics tools clearly show this event happening:
And the memory usage shows both apps:
During the start of the second instance solr tries to lock the index, but this is not possible because the first one is still using it (it also has the write.lock file). Sometimes the first one doesn't remove the write.lock file and this is were the problems start.
The second solr instance will never work correctly without manual intervention (manually deleting the write.lock file).
The solr logs:
Caused by: org.apache.solr.common.SolrException: Index dir 'D:\home\site\wwwroot\server\solr\****\data\index/' of core '*****' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native
and
org.apache.lucene.store.LockObtainFailedException
What can be done about this?
I was thinking of changing the lockType to a memory-based lock, but I'm not sure if that would work because both instances are alive at the same time during a short period of time.
You could try and set WEBSITE_DISABLE_OVERLAPPED_RECYCLING=1
Overlapped recycling makes it so that before the current instance on
an app is shut down, a new instance starts. It can in some cases cause
file locking issues, in which case you can try turning it off:
Reference
If you would like to run Solr without any locks at all you could do these by specifying in your solrconfig.xml instead of usual <lockType>native</lockType> you could use <lockType>none</lockType>.
Obviously, you need to be careful with this mechanism, since different Solr instances could try to change index at the same time which could lead to potential corruptions.
All available lock types are listed there
Related
By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.
Suppose I have the following situation. One of my Azure role instances happens to be started on a VM that runs inside a faulty server but Azure wiring processes don't see any problems. I somehow deduce this fact - for example I see an "impossible" call stack - one that can't happen in my program under any normal conditions.
So I'd like Azure to move my instance to another VM and have the underlying hardware checked and repaired.
How can I do that except contacting support?
A few comments:
You can have this done, sortof, by calling support. The support team won't move your VM to a new server just because you ask, but they will work with you to determine if the physical server really is bad, and if so move it to out of service.
RequestRecycle will only shut down the host process (ie. WaIISHost) and related processes and then restart them. It won't reboot the VM, clean boot, or redeploy.
You can try a 'Reimage' from the portal or Powershell if you suspect you might have a corrupt Windows installation. A Reimage will recreate the Windows partition from scratch.
In order to force a new VM to be on a new server you would have to do an in-place upgrade and modify the size of the VM (ie. go from Small to Medium). This will cause new VMs to be created on new servers. You can then do another in-place upgrade to revert back to the original size.
That being said, I strongly agree with Brian's comment that it is very unlikely that bad hardware is causing an 'impossible' callstack. I would recommend opening a support incident so you can find the actual root cause instead of just fixing the most visible symptom.
I don't think you can move a VM. But you could create a new staging deployment, swap it into production, and then destroy the old one. You can't actually guarantee that the VMs are on different physical machines, but it seems reasonably likely. The larger the VMs are, the more likely that they're on separate servers.
That said, it seems really unlikely that your problems are due to a hardware fault rather than some subtle bug.
The scenario is simple: using EF code first migrations, with multiple azure website instances, decent size DB like 100GB (assuming azure SQL), lots of active concurrent users..say 20k for the heck of it.
Goal: push out update, with active users, keep integrity while upgrading.
I've sifted through all the docs I can find. However the core details seem to be missing or I'm blatantly overlooking them. When Azure receives an update request via FTP/git/tfs, how does it process the update? What does it do with active users? For example, does it freeze incoming requests to all instances, let items already processing finish, upgrade/replace each instance, let EF migrations process, then let traffics start again? If it upgrades/refreshes all instances simultaneously, how does it ensure EF migrations run only once? If it refreshes instances live in a rolling upgrade process (upgrade 1 at a time with no inbound traffic freeze), how could it ensure integrity since instances in the older state would/could potentially break?
The main question, what is the real process after it receives the request to update? What are the recommendations for updating a live website?
To put it simply, it doesn't.
EF Migrations and Azure deployment are two very different beasts. Azure deployment gives you a number of options including update and staging slots, you've probably seen
Deploy a web app in Azure App Service, for other readers this is a good start point.
In General the Azure deployment model is concerned about the active connections to the IIS/Web Site stack, in general update ensures uninterrupted user access by taking the instance being deployed out of the load balancer pool and redirecting traffic to the other instances. It then cycles through the instances updating one by one.
This means that at any point in time, during an update deployment there will be multiple versions of your code running at the same time.
If your EF Model has not changed between code versions, then Azure deployment works like a charm, users won't even know that it is happening. But if you need to apply a migration as part of the migration BEWARE
In General, EF will only load the model if the code and DB versions match. It is very hard to use EF Migrations and support multiple code versions of the model at the same time
EF Migrations are largely controlled by the Database Initializer.
See Upgrade the database using migrations for details.
As a developer you get to choose how and when the database will be upgraded, but know that if you are using Mirgrations and deployment updates:
New code code will not easily run against the old data schema.
If the old code/app restarts many default initialization strategies will attempt roll the schema back, if this happens refer to point 1. ;)
If you get around the EF model loading up against the wrong version of the schema, you will experience exceptions and general failures when the code tries to use schema elements that are not there
The simplest way to manage a EF migration on a live site is to take all instances of the site down for deployments that include an EF Migration
- You can use a maintenance page or a redirect, that's up to you.
If you are going to this trouble, it is probably best to manually apply the DB update, then if it fails you can easily abort the deployment, because it hasn't started yet!
Otherwise, deploy the update and the first instance to spin up will run the migration, if the initializer has been configured to do so...
If you absolutely must have continuous deployment of both site code/content and model updates then EF migrations might not be the best tool to get started with as you will find it very restrictive OOTB for this scenario.
I was watching a "Fundamentals" course on Pluralsight and this was touched upon.
If you have 3 sites, Azure will take one offline and upgrade that, and then when ready restart it. At that point, the other 2 instances get taken off-line and your upgraded insance will start, thus running your schema changes.
When those 2 come back the EF migrations would already have been run, thus your sites are back.
In theory then it all sounds like it should work, although depending upon how much EF migrations need running, requests may be delayed.
However, the comment from the author was that in this scenario (i.e. making schema changes) you should consider if your website can run in this situation. The suggestion being that you either need to make your code work with both old and new schemas, or show a "maintenance system down page".
The summary seems to be that depending on what you are actually upgrading, this will impact and affect your choices and method of deployment.
Generally speaking if you want to support active upgrades you need to support multiple version of you application simultaneously. This is really the only way to reliably stay active while you migrate/upgrade. Also consider feature switches to scale up your conversion in a controlled manner.
I have a classic ASP application that has been stable for years and now we're having all kinds of problems with it. After moving the app between machines and wiping the original so we could have a fresh install of windows, we've come to the following "symptom". The app pools do not appear to allow for multiple simultaneous requests. Here's what we are seeing:
The app runs normally for most people, but when someone within one of the app pools accesses a long-running script (usually one with lots of DB access) all of the other users in the pool must wait for that script to complete. Once the script completes, everyone else's requests run. This initially made us suspect the DB connection string or something.
UNTIL... we noticed also that large file uploads into our system also cause the app pool to stop responding. What's interesting about this is that we're using the SAFileup COM+ object to do our uploads, which has a progress display in a pop-up window. When you go to upload the file, the progress display comes up, but then never refreshes to show upload progress. If you wait it out, however, the file will eventually upload and the other pending requests will process as normal.
Our app pools are in the default configuration, using the IWAM account to launch. I checked to ensure that the IWAM account has all the appropriate permissions. It does.
We've tried a variety of DB connection strings, none solved the problem (though I'm thinking it's not the DB connection string). Just in case someone thinks it is, here's our connection string: "Provider=SQLNCLI;Trusted_Connection=yes;Server=(local);Database=demo;". It couldn't be simpler. This string was previously not a problem.
I fussed with the web gardens thing and it does, indeed, make the system respond to multiple requests, but each worker thread in the garden has its own session state which causes our users to get booted when their request gets randomly assigned to a new worker thread. Only having a single worker process in the garden was never an issue before anyway.
I've used SQL Profiler and sp_who2 to see if during the long-running scripts there are any deadlocks or blocks on the SQL Server. There are not.
The issues initially started after we had installed some patches from Microsoft. We wiped a machine clean and installed Win2k3 server, then SP2, and then didn't patch anymore after that. The problem remained, so it doesn't appear to have been a patch.
I'm pretty much at a loss now... does anyone have any experience with similar issues? If so, how were they fixed?
Check that you don't have ASP debugging enabled on the server. This will force the ASP script engine to run on a single thread.
Sounds like an limit on the number of concurrent incoming requests to the IIS or the Windows Server..
Check out http://blogs.msdn.com/b/david.wang/archive/2006/04/12/howto-maximize-the-number-of-concurrent-connections-to-iis6.aspx and http://forums.iis.net/p/1152112/1880908.aspx#1880908 on how to tweak the settings.
I haven't found a definitive list out there, but hopefully someone's got one going or we can come up with one ourselves. What causes disruptions for .NET applications, or general service disruption, running on IIS? For instance, web.config changes will cause a recompilation in JIT (while just deploying a single page doesn't affect the whole app), and iisresets halt everything (natch, but you see where I'm going). How about things like creating a new virtual directory under a current web app?
It's helpful to know all the cases so you know if you can affect a change to a server without causing issues with the whole thing.
EDIT: I had IIS 6 in mind when I asked, but of course a list of anything different in other versions would be helpful as well to people.
It depends on what exactly you are talking about with disruptions. IISReset can cause a Service Unavailable message to display for a short time as IIS is shutdown and re-started.
Changes to the web.config, or adding a .dll file to the bin directory of an application causes a recycle of the application domain but that is not a disruption exactly, more of a "delay" in responding, the user will NOT see an error just a delayed response from the server. You can also get that from changing any files in App_Code or .vb files on non WAP developed sites.
You can also get IIS Worker Process Shutdowns due to inactivity, default setting is 20 minutes. Again this is a delay, not a lack of service.