Changing service tiers or performance level and database downtime - azure

I have identified that we may need to scale into the next service tier as some point soon (Standard to Premium).
For others interested, this article provides great guidelines for analysing your SQL Database.
My question: Is there any downtime while scaling to a different service tier or performance level?

Depends on your definition of "downtime". I have changed performance levels many times. Going from standard to premium we experienced many errors. Here are a few samples:
System.Data.SqlClient.SqlException (0x80131904): A transport-level
error has occurred when receiving results from the server. (provider:
TCP Provider, error: 0 - An existing connection was forcibly closed by
the remote host.) ---> System.ComponentModel.Win32Exception
(0x80004005): An existing connection was forcibly closed by the remote
host.
System.Data.SqlClient.SqlException (0x80131904): The ALTER DATABASE
command is in process. Please wait at least five minutes before
logging into database '...', in order for the command to complete.
Some system catalogs may be out of date until the command completes.
If you have altered the database name, use the NEW database name for
future activity.
System.Data.SqlClient.SqlException (0x80131904): The service has
encountered an error processing your request. Please try again. Error
code 40174. A severe error occurred on the current command. The
results, if any, should be discarded.
System.Data.DataException: Unable to commit the transaction. The
underlying connection is not open or has not been initialized.
My advice is to change performance levels off hours or during maintenance periods if possible.

There is no downtime when changing tiers, I have done it a few times. The change is not immediate though, it does take at least 5 minutes but during that time it will operate as normal.

As above, it depends on your definition of downtime. There is a brief period as the tier switches when transactions may be rolled back.
From 'Scaling up or scaling down...' section of this page: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers
Note that changing the service tier and/or performance level of a database creates a replica of the original database at the new performance level, and then switches connections over to the replica. No data is lost during this process but during the brief moment when we switch over to the replica, connections to the database are disabled, so some transactions in flight may be rolled back. This window varies, but is on average under 4 seconds, and in more than 99% of cases is less than 30 seconds. Very infrequently, especially if there are large numbers of transactions in flight at the moment connections are disabled, this window may be longer.
Since "in-flight transaction" usually refers to a transaction that is running when a connection is broken, it seems that either connections may be broken mid-transaction, or, transactions operating across multiple connections might fail and be rolled back if one the connections is denied during the switch. If the latter, then simple transactions may not often be affected during the switch. If the former, then busy databases will almost certain see some impact.

There is no downtime when changing TIERS but there IS downtime when changing billing models. You literally have to backup your databases, spin up new databases in the new billing model servers, and restore them. You then have to change all your database references in apps or websites. If you want to change tiers FROM a billing tier that is no longer supported you WILL need to migrate to the new billing model first. We learned this the hard way. Microsoft doesn't make it easy either - it's not a pushbutton operation.

Related

Limit Azure Function restart rate

I already faced similar problem few times:
Azure Function with ServiceBusTrigger by some reason (misconfiguration, infrastructure issues, doesn't really matter) fails to connect to ServiceBus (so it happens on trigger level) and it leads to two issues:
It tries to restart all the time, increasing CPU consumption
It generates literally a millions of exceptions in AppInsights, which leads to quota exceedance
Practically every error in configuration means significantly increased bills and requires thorough monitoring after every deployment, which is annoying and error prone solution.
So, my question: If there is a way to set some delay between restart attempts to (for example) one second? And, as addition - is there way to limit amount of restart attempts and then shut down the Function?
Establishing a connection to the broker to fetch messages is Functions responsibility, Scale Controller. That aspect is entirely abstracted from customers and not configurable. I suggest raising an issue with Azure Functions team, likely under the Runtime repo.

SQL Azure Premium tier is unavailable for more than a minute at a time and we're around 10-20% utilization, if that

We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:
We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.

Massive test against azure getting connection refused or service unavailable

We have a cloud service that gets requests from users, passes the data (two params) to table entities and puts them into cloudtables (using BatchTableOperations to InsertOrReplace rows). The method is that simple, trying to keep it light and fast (partition key and parttionkey/rowkey pairs issues are controlled).
We need the Cloud Service to cope with about 10k to 15k "concurrent" requests. We first used queues to get users data and a Worker Role to process queue messages and put them into SQL. Although no error rose and no data was lost, processing was too slow for our needs. Now we are trying cloud tables to see if we can process data faster. With smaller amounts of requests, process is fast, but as we get more requests, errors occur and data is lost.
I've set up a few virtual machines for testing in the same virtual network that the cloud service is on, to prevent firewall to stop requests. A jMeter test with 1000 threads and 5 loops, gets 0% error. Same test from 2 virtual machines is ok too. Adding a third machine causes first errors (0.14% requests get Service unavailable 503 errors). Massive tests from 10 machines, 1000 threads and 2 loops gets massive 503 and/or connection refused errors. We have tried scaling cloud service up to 10 instances but that makes little difference on results.
I'm a bit stuck with this issue, and don't know if I'm focussing the problem with the right tools. Any suggestion will be highly welcome.
The issue may be related to throttling at the storage level. Please look at the scalability targets specified by Windows Azure Storage team here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. You may want to try doing the load test keeping these scalability targets into consideration.

Detect if SQL Azure is throttling

I have an Azure worker role that inserts a batch of records into a table. Yesterday, it took at most 5 minutes to insert the records, but today it has been taking up to a couple of hours. I suspect that the process is being throttled, but I don't get any exceptions. Does SQL Azure always return an error if you are being throttled, or is there another way to detect if you are being throttled?
In case of CPU throttling SQL Database will not throw an error but will slowdown the operation. At this time there is no mechanism to determine whether this form of throttling is taking place other than possibly looking at the query stats telling that the work is taking place slowly (if your CPU time is lower than usual). Check this link for details about this behavior: performance and elasticity guid (look for "Performance Thresholds Monitored by Engine Throttling").
One of the newer capabilities is the ability to monitor the number of outstanding requests a SQL Azure database has. You can do this with this query:
select count(*) from sys.dm_exec_requests
As you will see in this documentation reaching the limit of worker threads is a key reason for being throttled. Also documented here is that as you approach 180 worker threads you can expect to be throttled.
This is one of the things used in the Cotega monitoring service for SQL Azure to detect issues. [Disclaimer: I work on this service]

IIS Connection Pool interrogation/leak tracking

Per this helpful article I have confirmed I have a connection pool leak in some application on my IIS 6 server running W2k3.
The tough part is that I'm serving 300 websites written by 700 developers from this server in 6 application pools, 50% of which are .NET 1.1 which doesn't even show connections in the CLR Data performance counter. I could watch connections grow on my end if everything were .NET 2.0+, but I'm even out of luck on that slim monitoring tool.
My 300 websites connect to probably 100+ databases spread out between Oracle, SQLServer and outliers, so I cannot watch the connections from the database end either.
Right now my best and only plan is to do a loose binary search for my worst offenders. I will kill application pools and slowly remove applications from them until I find which individual applications result in the most connections dropping when I kill their pool. But since this is a production box and I like continued employment, this could take weeks as a tracing method.
Does anyone know of a way to interrogate the IIS connection pools to learn their origin or owner? Is there an MSMQ trigger I might be able to which I might be able to attach when they are created? Anything silly I'm overlooking?
Kevin
(I'll include the error code to facilitate others finding your answers through search:
Exception: System.InvalidOperationException
Message: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.)
Try starting with this first article from Bill Vaughn.
Todd Denlinger wrote a fantastic class http://www.codeproject.com/KB/database/connectionmonitor.aspx which watches Sql Server connections and reports on ones that have not been properly disposed within a period of time. Wire it into your site, and it will let you know when there is a leak.

Resources