SQL Azure Premium tier is unavailable for more than a minute at a time and we're around 10-20% utilization, if that - azure

We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:

We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.

Related

Azure SQL Server Database serverless "Auto-pause": How fast is resuming?

In Azure, there is this "auto-pause" feature for serverless SQL Server Databases
Question:
Is there a quantitative measure (as opposed to qualitative) of how fast is "Resume" on a serverless SQL Database on Azure?
I'm asking this because in my experience with DTU based tiers (S0, S1, S2, etc.), changing from one tier to another takes around 2-3 minutes, and within that interval all queries fail with timeout errors.
I want to know if "resuming" offers a similar experience (I wouldn't like the query that triggers the resume to be erratic)
The latency to auto-resume and auto-pause a serverless database is generally order of 1 minute to auto-resume and 1-10 minutes after the expiration of the delay period to auto-pause.
According to the documentation here - https://learn.microsoft.com/en-us/azure/azure-sql/database/serverless-tier-overview?view=azuresql
From personal experience, a 1GB DB takes between 1-2 minutes to resume. During this time any queries, including the one that prompted the resume will timeout.

Create database within Azure SQL elastic pool takes almost 10 minutes to complete

We use Azure SQL databases and an elastic pool (level "Standard").
Usually the creation of a new customer database takes approximately 1-2 minutes but suddenly it started taking way longer (up to 10 minutes) and I have no idea why this is happening. I checked the pool in the Azure portal and everything seems fine. We are still far away from reaching the given limits (257/500 databases; ~11GB/200GB data size). Upscaling for a short period of time has no effect.
Is there anything else I can do?
I think there are some ongoing issue at Microsoft cloud services just check if your issue related to that, if that’s true your issue should be temporary

Azure slow response times

Wondering if an Azure experts out there can give me some suggestions, we have a App Service app running and have noticed that on the first few requests (even if always on is ON) it can take a very long time for response.
The below chart is what we observed ,one can see that it takes up to 2 minutes initially and then afterwards we get more reasonable response times of a few milliseconds/seconds.
How can we make sure that it ALWAYS responds quickly? As a simple test, it is not doing anything processing intensive, just a few simple DB queries to check if a key exists.
At the beginning (the very first few minutes) Azure SQL Databases run queries slowly due to reduce memory allocation. You can see the query plan of those queries that run slowly at first and then show good performance and you can see query plan is the same. On the first run you may see query waits are: MEMORY_ALLOCATION_EXT, IO_QUEUE_LIMIT or PAGEIOLATCH_SH.
After periods of no activity, failovers or scaling up/down tiers memory allocation may be reduced and queries may show poor performance the first few minutes.
Hope this helps

Occasional delays in response from azure cloud service

I maintain an azure cloud service. It is set to auto-scale based on load. To monitor the health of this service I have another service which pings this service every 2 minutes. The usual response time from this service is around 100ms.
Once or twice a week I see that the service does not respond. It is not really a worry for me - because it happens quite infrequently. I still am trying to figure out what could be causing the service to not respond. I do not think the problem is with the pinging service - I don't see any of the other services (not on azure, but on other servers) that it pings having any issues.
What could be causing these occasional delays. Any other azure service owners seeing such delays ?
Having quite similar problems. But I use Applications Inside, so I have some statistics. For example that reponse time increases together with SQL azure access time and CPU usage. My average response time according to Applications Inside is about 600ms and average RPS is about 0,6. During these problems RPS usually higher than avarage - up to 1.5, but average response time grows up to 1min! (During the day my RPS can grow up to 3 or even higher without any reponse time growth). As I have 1min sql connection timeout and I have drammatical growth of total SQL azure access time during this periods I can assume that problem happens bacause of SQL Azure. This also happens once a day or two, for about 10-15 minutes max and my ping service also always reports that service doesn't respond.
So my advice here - install Application Insights to analyze what happens dusring these response delays. It would be great if you share your results here.
P.S. I also use autoscale based on load. Though it doesn't really help in these concrete situations.

Minimize downtime (Azure Website + SQL DB)

My Azure websites are down a lot, four outages (30 minutes - 3 hours) in the past 30 days. I only use one small standard website and one web SQL DB in US West so I can't expect 99.5%. This week (a few days ago and currently), 503 errors were / are the problem, but I also experienced substantial DB downtimes at other occasions.
My question is: what can I do (with hopefully not too much additional costs and effort) to raise stability? Which measures did other Azure users try? Would this have prevented the 3 hour downtime last Monday?
There are (at least) three things you can do:
Scale your website so that you have 2 instances running (How to Scale Websites);
Deploy a another copy of the website in a different region, and use traffic manager to Load Balance them (Traffic Manager Overview)
Create another instance of the database, and sync them (Getting Started with Azure SQL Data Sync)
I'd like to point out that with Websites, you get a 99.9% availability for websites running is Basic and Standard mode regardless of the number of instances deployed. You can view the SLA document here: http://go.microsoft.com/fwlink/?linkid=301329&clcid=0x409
Regarding the 503 errors you experiences, this was an intermittent issue http://azure.microsoft.com/en-us/status/#history

Resources