I have a .Net web app hosted on Azure and am on the S1 Production pricing tier (1x core, 1.75 GB memory, A-Series compute). What's weird is I am going through extended periods of poor performance. Usually my average response time is around the 1.4 s mark. Not good by any stretch but it's something I can work with. However I'm experiencing extended periods where the response time shoots up to around the 5 s or greater mark. These periods last for days, up to a week, before coming back to normal levels. My knowledge of Azure is pretty limited but I can't seem to find anything that would explain this.
average response time over the last 30 days
You might first want to identify if this is an issue with your web app itself or is this the trend of app usage(i.e. it receives max hits during specific weeks of a month).
There are several areas you might want to look for further diagnosis. A few are -
Look for the number of requests during the time the website is slow. This is a web part on the website overview page.
Check the diagnose and solve problems. It is self-service diagnostic and troubleshooting experiencing to help you resolve issues with your web app.
If you have a considerable user base(number of users) and its a production environment sometimes a 1 Core + 1.75 GB RAM might not be sufficient to bear the load. If you determine that this is due to a usage trend from your users then you can plan for scaling out/up your application to meet the demands of high usage.
Related
I’m about to start work on an API that will literally go from 0 RPS to a couple hundred thousand HTTP RPS at the same time and run at that rate for ~2 mins. All processing of those 30 million requests must finish by the end of that 2 min period. This would happen 7 times a WEEK.
Going serverless with Azure Functions in Consumption Plan Hosting Mode sounds appealing. This document describes that a scale controller exists to coordinate app instances, but doesn't really discuss what I can expect from it for HTTP triggers. I can’t find any info that says the scale controller will be able to respond in the time frame I'd need.
The best info I could find was this info saying it took nearly 8 mins to scale up for his tests.
Is this a bad use case for Azure Functions in consumption mode?
Obviously, spinning up a testing harness that is capable of issuing 30 million requests within 2 minutes is an undertaking of its own, and an expensive one. I'd like to learn from others who have already done so.
Based on my experience, this scenario is not properly covered by Consumption Plan. They can scale up to many instances, but not very rapidly. 2 minutes is way too fast to rely on.
I was mostly working with queues, not HTTP, but I got delays up to 40 minutes caused by low pace of scaling up.
If you can predict which 2 minutes are going to be heavy-loaded, your best bet could be to provision the capacity with a script (or another Function).
We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:
We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.
My Azure websites are down a lot, four outages (30 minutes - 3 hours) in the past 30 days. I only use one small standard website and one web SQL DB in US West so I can't expect 99.5%. This week (a few days ago and currently), 503 errors were / are the problem, but I also experienced substantial DB downtimes at other occasions.
My question is: what can I do (with hopefully not too much additional costs and effort) to raise stability? Which measures did other Azure users try? Would this have prevented the 3 hour downtime last Monday?
There are (at least) three things you can do:
Scale your website so that you have 2 instances running (How to Scale Websites);
Deploy a another copy of the website in a different region, and use traffic manager to Load Balance them (Traffic Manager Overview)
Create another instance of the database, and sync them (Getting Started with Azure SQL Data Sync)
I'd like to point out that with Websites, you get a 99.9% availability for websites running is Basic and Standard mode regardless of the number of instances deployed. You can view the SLA document here: http://go.microsoft.com/fwlink/?linkid=301329&clcid=0x409
Regarding the 503 errors you experiences, this was an intermittent issue http://azure.microsoft.com/en-us/status/#history
First - a little bit about my background: I have been programming for some time (10 years at this point) and am fairly competent when it comes to coding ideas up. I started working on web-application programming just over a year ago, and thankfully discovered nodeJS, which made web-app creation feel a lot more like traditional programming. Now, I have a node.js app that I've been developing for some time that is now running in production on the web. My main confusion stems from the fact that I am very new to the world of the web development, and don't really know what's important and what isn't when it comes to monitoring my application.
I am using a Joyent SmartMachine, and looking at the analytics options that they provide is a little overwhelming. There are so many different options and configurations, and I have no clue what purpose each analytic really serves. For the questions below, I'd appreciate any answer, whether it's specific to Joyent's Cloud Analytics or completely general.
QUESTION ONE
Right now, my main concern is to figure out how my application is utilizing the server that I have it running on. I want to know if my application has the right amount of resources allocated to it. Does the number of requests that it receives make the server it's on overkill, or does it warrant extra resources? What analytics are important to look at for a NodeJS app for that purpose? (using both MongoDB and Redis on separate servers if that makes a difference)
QUESTION TWO
What other statistics are generally really important to look at when managing a server that's in production? I'm used to programs that run once to do something specific (e.g. a raytracer that finishes running once it has computed an image), as opposed to web-apps which are continuously running and interacting with many clients. I'm sure there are many things that are obvious to long-time server administrators that aren't to newbies like me.
QUESTION THREE
What's important to look at when dealing with NodeJS specifically? What are statistics/analytics that become particularly critical when dealing with the single-threaded event loop of NodeJS versus more standard server systems?
I have other questions about how databases play into the equation, but I think this is enough for now...
We have been running node.js in production nearly an year starting from 0.4 and currenty 0.8 series. Web app is express 2 and 3 based with mongo, redis and memcached.
Few facts.
node can not handle large v8 heap, when it grows over 200mb you will start seeing increased cpu usage
node always seem to leak memory, or at least grow large heap size without actually using it. I suspect memory fragmentation, as v8 profiling or valgrind shows no leaks in js space nor resident heap. Early 0.8 was awful in this respect, rss could be 1GB with 50MB heap.
hanging requests are hard to track. We wrote our middleware to monitor these especially as our app is long poll based
My suggestions.
use multiple instances per machine, at least 1 per cpu. Balance with haproxy, nginx or such with session affinity
write midleware to report hanged connections, ie ones that code never responded or latency was over threshold
restart instances often, at least weekly
write poller that prints out memory stats with process module one per minute
Use supervisord and fabric for easy process management
Monitor cpu, reported memory stats and restart on threshold
Whichever the type of web app, NodeJS or otherwise, load testing will answer whether your application has the right amount of server resources. A good website I recently found for this is Load Impact.
The real question to answer is WHEN does the load time begin to increase as the number of concurrent users increase? A tipping point is reached when you get to a certain number of concurrent users, after which the server performance will start to degrade. So load test according to how many users you expect to reach your website in the near future.
How can you estimate the amount of users you expect?
Installing Google Analytics or another analytics package on your pages is a must! This way you will be able to see how many daily users are visiting your website, and what is the growth of your visits from month-to-month which can help in predicting future expected visits and therefore expected load on your server.
Even if I know the number of users, how can I estimate actual load?
The answer is in the F12 Development Tools available in all browsers. Open up your website in any browser and push F12 (or for Opera Ctrl+Shift+I), which should open up the browser's development tools. On Firefox make sure you have Firebug installed, on Chrome and Internet Explorer it should work out of the box. Go to the Net or Network tab and then refresh your page. This will show you the number of HTTP requests, bandwidth usage per page load!
So the formula to work out daily server load is simple:
Number of HTTP requests per page load X the average number of pages load per user per day X Expected number of concurrent users = Total HTTP Requests to Server per Day
And...
Number of MBs transferred per page load X the average number of pages load per user per day X Expected number of concurrent users = Total Bandwidth Required per Day
I've always found it easier to calculate these figures on a daily basis and then extrapolate it to weeks and months.
Node.js is single threaded so you should definitely start a process for every cpu your machine has. Cluster is by far the best way to do this and has the added benefit of being able to restart died workers and to detect unresponsive workers.
You also want to do load testing until your requests start timing out or exceed what you consider a reasonable response time. This will give you a good idea of the upper limit your server can handle. Blitz is one of the many options to have a look at.
I have never used Joyent's statistics, but NodeFly and their node-nodefly-gcinfo is a great tools to monitor node processes.
I'm an msdn subscriber and I'm looking at Azure as a possible platform for a new website that will test the water of a new service. This website is expecting low to very low traffic at the time of launch. I've heard that this kind of traffic levels is very expensive for Azure but since they have this msdn offer, I thought I'd finally take a look at Azure.
In the offer, I'm looking at getting "750 small compute hours per month". From the reading I've done, this seems that, if I purchase nothing more than what's given (although the subscription itself is thousands of dollars of course), that an entire month would be covered. Since 24 (hours) x 31 (max days in a month) = 744 I'm still below my allotted 750 for the month.
Am I missing something else from this simple equation? Is there further aspects that could cause the site to be "turned off" temporarily that should be considered?
Yes, you can indeed run a small instance during a whole month. Or you can have 2 extra-small instances instead (having 2 instances means you're covered by the SLA).
There are 2 other things you need to consider:
Depending on your subscription you can have maximum 45GB of storage (blob/table/queue). If you use Virtual Machines you need to know that the system disk (and additional data disks) are persisted as blobs, so make sure not to reach the limit here.
There are also other limits active, but the most important one besides storage is the data transfer limit which is also very limited (max 35GB out).
If you're expecting very low traffic, did you ever consider Windows Azure Web Sites? You get 10 of those for free during 12 months. The free ones run on shared instances, but they are perfect to host the first low-traffic version of your app.