Sadly last week azure transfer one database from web tier to s1 tier. I manually increase the tier to s2. worked hard to change some stuff in the system so the dtu wont reach 100%.
Now i have new situation - i get background stuff that run and doing stuff in the db like delete etc. the problem is that the background stuff consume 100 percent dtu and my website start getting errors.
my question is: is there a way to tell the sql per query/operation to consume max of X dtu? for example i want to create an index and again when i do the operation my dtu raise to 100 and it stayed there allot of time - guess its a big index to build - so again im stuck and i cancel the query because i dont want my end users to suffer lags.
all those issue didnt exists in the web tier and everything worked smoothly.
That's a very nice suggestion,unfortunately limiting a particular query or operation to consume limited DTU is not possible ..may be in future versions they might bring resource governor like tools
Closest thing i can think of limiting DTU for a query is set to
Option (MAXDOP 1)
Query may go in Parallel and consume more resources for each thread ,so limiting MAXDOP will help in limiting DTU with some caveats
Related
Wondering if an Azure experts out there can give me some suggestions, we have a App Service app running and have noticed that on the first few requests (even if always on is ON) it can take a very long time for response.
The below chart is what we observed ,one can see that it takes up to 2 minutes initially and then afterwards we get more reasonable response times of a few milliseconds/seconds.
How can we make sure that it ALWAYS responds quickly? As a simple test, it is not doing anything processing intensive, just a few simple DB queries to check if a key exists.
At the beginning (the very first few minutes) Azure SQL Databases run queries slowly due to reduce memory allocation. You can see the query plan of those queries that run slowly at first and then show good performance and you can see query plan is the same. On the first run you may see query waits are: MEMORY_ALLOCATION_EXT, IO_QUEUE_LIMIT or PAGEIOLATCH_SH.
After periods of no activity, failovers or scaling up/down tiers memory allocation may be reduced and queries may show poor performance the first few minutes.
Hope this helps
We currently have an elastic pool of databases in Azure that we would like to scale based on high eDTU usage. There are 30+ databases in the pool and they currently use 100GB of storage (although this is likely to increase).
We were planning on increasing the eDTU's allocated to the pool when we detect high eDTU usage. However a few posts online have made me question how well this will work. The following quote is taken from the azure docs - https://learn.microsoft.com/en-us/azure/sql-database/sql-database-resource-limits
The duration to rescale pool eDTUs can depend on the total amount of storage space used by all databases in the pool. In general, the rescaling latency averages 90 minutes or less per 100 GB.
If i am understanding this correctly this means that if we want to increase the eDTUs we will have to wait for on average 90 minutes per 100GB. If this is the case scaling dynamically won't be suitable for us as 90 minutes to wait for an increase in performance is far too long.
Can anyone confirm if what i have said above is correct? And are there any alternative recommendations to increase eDTUs dynamically without having to wait for such a long period of time?
This would also mean if we wanted to scale based on a schedule, i.e. scale up eDTUs at 8am we would actually have to initiate the scaling at 6:30am to allow for the estimated 90mins of scaling time - if my understanding of this is correct.
When you scale the pool eDTUs, Azure may have to migrate data (this is a shared database service). This will take time, if required. I have seen scaling being instant and I have seen it take a lot of time. I think that Microsoft's intent is to offer cost savings via Elastic Pools and not the thru ability to quickly change eDTUs.
The following is the answer provided by a Microssoft Azure SQL Database manager:
For rescaling a Basic/Standard pool within the same tier, some service
optimizations have occurred so that the rescaling latency is now
generally proportional to the number of databases in the pool and
independent of their storage size. Typically, the latency is around
30 seconds per database for up to 8 databases in parallel provided
pool utilization isn’t too high and there aren’t long running
transactions. For example, a Standard pool with 500 databases
regardless of size can often be rescaled in around 30+ minutes (i.e.,
~ 500 databases * 30 seconds / 8 databases in parallel).
In the case of a Premium pool, the rescaling latency is still
proportional to size-of-data.
This Azure SQL Database manager promised to update Azure documentation as soon as they finish implementing more improvements.
Thank you for your patience waiting for this answer.
We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:
We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.
My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.
I have an Azure worker role that inserts a batch of records into a table. Yesterday, it took at most 5 minutes to insert the records, but today it has been taking up to a couple of hours. I suspect that the process is being throttled, but I don't get any exceptions. Does SQL Azure always return an error if you are being throttled, or is there another way to detect if you are being throttled?
In case of CPU throttling SQL Database will not throw an error but will slowdown the operation. At this time there is no mechanism to determine whether this form of throttling is taking place other than possibly looking at the query stats telling that the work is taking place slowly (if your CPU time is lower than usual). Check this link for details about this behavior: performance and elasticity guid (look for "Performance Thresholds Monitored by Engine Throttling").
One of the newer capabilities is the ability to monitor the number of outstanding requests a SQL Azure database has. You can do this with this query:
select count(*) from sys.dm_exec_requests
As you will see in this documentation reaching the limit of worker threads is a key reason for being throttled. Also documented here is that as you approach 180 worker threads you can expect to be throttled.
This is one of the things used in the Cotega monitoring service for SQL Azure to detect issues. [Disclaimer: I work on this service]