Minimize downtime (Azure Website + SQL DB) - azure

My Azure websites are down a lot, four outages (30 minutes - 3 hours) in the past 30 days. I only use one small standard website and one web SQL DB in US West so I can't expect 99.5%. This week (a few days ago and currently), 503 errors were / are the problem, but I also experienced substantial DB downtimes at other occasions.
My question is: what can I do (with hopefully not too much additional costs and effort) to raise stability? Which measures did other Azure users try? Would this have prevented the 3 hour downtime last Monday?

There are (at least) three things you can do:
Scale your website so that you have 2 instances running (How to Scale Websites);
Deploy a another copy of the website in a different region, and use traffic manager to Load Balance them (Traffic Manager Overview)
Create another instance of the database, and sync them (Getting Started with Azure SQL Data Sync)

I'd like to point out that with Websites, you get a 99.9% availability for websites running is Basic and Standard mode regardless of the number of instances deployed. You can view the SLA document here: http://go.microsoft.com/fwlink/?linkid=301329&clcid=0x409
Regarding the 503 errors you experiences, this was an intermittent issue http://azure.microsoft.com/en-us/status/#history

Related

Fluctuating response times for Azure website

I have a .Net web app hosted on Azure and am on the S1 Production pricing tier (1x core, 1.75 GB memory, A-Series compute). What's weird is I am going through extended periods of poor performance. Usually my average response time is around the 1.4 s mark. Not good by any stretch but it's something I can work with. However I'm experiencing extended periods where the response time shoots up to around the 5 s or greater mark. These periods last for days, up to a week, before coming back to normal levels. My knowledge of Azure is pretty limited but I can't seem to find anything that would explain this.
average response time over the last 30 days
You might first want to identify if this is an issue with your web app itself or is this the trend of app usage(i.e. it receives max hits during specific weeks of a month).
There are several areas you might want to look for further diagnosis. A few are -
Look for the number of requests during the time the website is slow. This is a web part on the website overview page.
Check the diagnose and solve problems. It is self-service diagnostic and troubleshooting experiencing to help you resolve issues with your web app.
If you have a considerable user base(number of users) and its a production environment sometimes a 1 Core + 1.75 GB RAM might not be sufficient to bear the load. If you determine that this is due to a usage trend from your users then you can plan for scaling out/up your application to meet the demands of high usage.

SQL Azure Premium tier is unavailable for more than a minute at a time and we're around 10-20% utilization, if that

We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:
We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.

Occasional delays in response from azure cloud service

I maintain an azure cloud service. It is set to auto-scale based on load. To monitor the health of this service I have another service which pings this service every 2 minutes. The usual response time from this service is around 100ms.
Once or twice a week I see that the service does not respond. It is not really a worry for me - because it happens quite infrequently. I still am trying to figure out what could be causing the service to not respond. I do not think the problem is with the pinging service - I don't see any of the other services (not on azure, but on other servers) that it pings having any issues.
What could be causing these occasional delays. Any other azure service owners seeing such delays ?
Having quite similar problems. But I use Applications Inside, so I have some statistics. For example that reponse time increases together with SQL azure access time and CPU usage. My average response time according to Applications Inside is about 600ms and average RPS is about 0,6. During these problems RPS usually higher than avarage - up to 1.5, but average response time grows up to 1min! (During the day my RPS can grow up to 3 or even higher without any reponse time growth). As I have 1min sql connection timeout and I have drammatical growth of total SQL azure access time during this periods I can assume that problem happens bacause of SQL Azure. This also happens once a day or two, for about 10-15 minutes max and my ping service also always reports that service doesn't respond.
So my advice here - install Application Insights to analyze what happens dusring these response delays. It would be great if you share your results here.
P.S. I also use autoscale based on load. Though it doesn't really help in these concrete situations.

Windows Azure reliability (my server just lost its drives & sites, then 20 minutes later they reappeared)

I am on a Windows Azure trial to evaluate migrating a number of commercial ASP.NET sites to Azure from dedicated hosting. All was going OK ... until just now!
Some background - the sites are set up under Web Roles (i.e. as opposed to Web Sites) using SQL Azure and SQL Reporting. The site content was under the X: drive (there was also a B: drive that seemed to be mapped to the same location). There are several days left of the trial.
Without any apparent warning my test sites suddenly stopped working. Examining the server (through RDP) I saw that the B: and X: drives had disappeared (just C: D & E: I think were left), and in IIS the application pools and Sites had disappeared. In the Portal however, nothing seemed to have changed - the same services & config seemed to be there.
Then about 20 minutes later the missing drives, app pools and sites reappeared and my test sites started working again! However, the B: drive was gone and now there was an F: drive (showing the same as X:); also the MS ReportViewer 2008 control that I had installed earlier in the day was gone. It is almost as if the server had been replaced with another (but the IIS config was restored from the original).
As you can imagine, this makes me worried! If this is something that could happen in production there is no way I would consider hosting commercial sites for clients on Azure (unless there is some redundancy system available to keep a site up when such a failure occurs).
Can anyone explain what may have happened, if this is possible/predictable under a live subscription, and if so how to work around it?
One other thing to keep in mind is that an Azure Web Role is not persistent. I'm not sure how you installed the MS Report Viewer 2008 control but anything you add or install outside of a deployment package when you push your solution to Azure is not guaranteed to be available at some future point.
I admit that I don't fully understand the full picture when it comes to the overall architecture of Azure but I do know that Web Roles can and do re-create themselves from time to time. When the role recycles, it returns to the state as it was when it was installed. This is why Microsoft suggests using at least 2 instances of your role because while one or the other may recycle they will never recycle both at the same time, part of what guarantees the 99.9% uptime.
You might also want to consider an Azure VM. They are persistent but require you to maintain the server in terms of updates and software much in the way I suspect you are already doing with your dedicated hosting.
I've been hosting my solution in a large (4 core) web role, also using SQL Azure, for about two years and have had great success with it. I have roughly 3,000 users and rarely see the utilization of my web role go over 2% (meaning I've got a lot of room to grow). Overall it is a great hosting solution in my opinion.
According to the Azure SLA Microsoft guarantees up time of 99.9% or higher on all its products per billing month. (20 min on the month would be .0004% loss, not being critical, just suggesting that they are still within their SLA)
Current status shows that sql databases were having issues in the US north last night, but all services appear to be up currently
Personally, I have seen the dashboard go down, and report very weird problems, but the services that I programmed to worked just fine all the way through it. When I experienced this problem it was reported on the Azure Status, the platform status and the twitter feed
While I have seen bumps, they are few and far between, and I find reliability to be perceptibly higher than other providers that I have worked with.
As for workarounds I would suggest a standard mode for your websites and increasing instances of the site. You might try looking into the new add ins that are available with the latest Azure release. Active Cloud Monitoring by Metrichub might be what you require.
It sounds like you're expecting the web role to act as a Virtual Machine instance.
Web Roles aren't persistent (the machine can be destroyed and recreated at any time), so you should do any additional required set up as a 'startup task' in your Azure project (never install software manually).
Because of this you need at least 2 instances so that rolling upgrades (i.e. Windows security patches, hotfixes and so on) can be performed automatically without having your entire deployment taken offline.
If this doesn't suit your use case then you should look at Azure Virtual Machines, but you'll need to manage updates and so on yourself. It's usually better to use Web Roles properly as you can then do scaling and so on a lot more easily.

Windows Azure, MSDN offer, 750 small compute hours

I'm an msdn subscriber and I'm looking at Azure as a possible platform for a new website that will test the water of a new service. This website is expecting low to very low traffic at the time of launch. I've heard that this kind of traffic levels is very expensive for Azure but since they have this msdn offer, I thought I'd finally take a look at Azure.
In the offer, I'm looking at getting "750 small compute hours per month". From the reading I've done, this seems that, if I purchase nothing more than what's given (although the subscription itself is thousands of dollars of course), that an entire month would be covered. Since 24 (hours) x 31 (max days in a month) = 744 I'm still below my allotted 750 for the month.
Am I missing something else from this simple equation? Is there further aspects that could cause the site to be "turned off" temporarily that should be considered?
Yes, you can indeed run a small instance during a whole month. Or you can have 2 extra-small instances instead (having 2 instances means you're covered by the SLA).
There are 2 other things you need to consider:
Depending on your subscription you can have maximum 45GB of storage (blob/table/queue). If you use Virtual Machines you need to know that the system disk (and additional data disks) are persisted as blobs, so make sure not to reach the limit here.
There are also other limits active, but the most important one besides storage is the data transfer limit which is also very limited (max 35GB out).
If you're expecting very low traffic, did you ever consider Windows Azure Web Sites? You get 10 of those for free during 12 months. The free ones run on shared instances, but they are perfect to host the first low-traffic version of your app.

Resources