Minimize downtime in Azure - azure

We are experiencing a very serious unscheduled downtime of our Azure application today for what is now coming up to 9 hours. We reported to Azure support and the ops team is actively trying to fix the problem and I do not doubt that. We managed to get our application running on another "test" hosted service that we have and redirected our CNAME to point at the instance so our customers are happy, but the "main" hosted service is still unavailable.
My own "finger in the air" instinct is that the issue is network related within our data center (west europe), and indeed, later on in the day the service dash board has gone red for that region with a message to that effect. (Our application is showing as "Healthy" in the portal, but is unreachable via our cloudapp.net URL. Additionally threads within our application are logging sql connection exceptions into our storage account as it cannot contact the DB)
What is very strange, though, is that the "test" instance I referred to above is also in the same data centre and has no issues contacting the DB and it's external endpoint is fully available.
I would like to ask the community if there is anything that I could have done better to avoid this downtime? I obeyed the guidance with respect to having at least 2 roles instances per role, yet I still got burned. Should I move to a more reliable data centre? Should I deploy my application to multiple data centres? How would I manage the fact that my SQL-Azure DB is in the same datacentre?
Any constructive guidance would be appreciated - being a techie, I've never had a more frustrating day being able to do nothing to help fix the issue.

There was an outage in the European data center today with respect to SQL Azure. Some of our clients got hit and had to move to another data center.
If you are running mission critical applications that cannot be down, I would deploy the application into multiple regions. DNS resolution is obviously a weak link right now in Azure, but can be worked around (if you only run a website it can be done very simply using Response.Redirects or similar)
Now, there is a data synchronization service from Microsoft that will sync up multiple SQL Azure databases. Check here. This way, you can have mirror sites up in different regions and have them be in sync with SQL Azure perspective
Also, be a good idea to employ a 3rd party monitoring service that would detect problems with your deployed instances externally. AzureWatch can notify or even deploy new nodes if you choose to, when some of the instances turn "Unresponsive"
Hope this helps

I can offer some guidance based on our experience:
Host your application in multiple data centers, complete with Sql Azure databases. You can connect each application to its data center specific Sql Server. You can also cache any external assets (images/JS/CSS) on the data center specific Windows Azure machine or leverage Azure Blog Storage. Note: Extra costs will be incurred.
Setup one-way SQL replication between your primary Sql Azure DB and the instance in the other data center. If you want to do bi-rectional replication, take a look at the MSDN site for guidance.
Leverage Azure Traffic Manager to route traffic to the data center closest to the user. It has geo-detection capabilities which will also improve the latency of your application. So you can redirect map http://myapp.com to the internal url of your data center and a user in Europe should automatically get redirected to the European data center and vice versa for USA. Note: At the time of writing this post, there is not a way to automatically detect and failover to a data center. Manual steps will be involved, once a failover is detected and failover is a complete set (i.e. you will failover both the Windows Azure AND Sql Azure instances). If you want micro-level failover, then I suggest putting all your config the in the service config file and encrypt the values so you can edit the connection string to connect instance X to DB Y.
You are all set now. I would create or install a local application to detect the availability of the site. A better solution would be to create a page to check for the availability of application specific components by writing a diagnostic page or web service and then poll it from a local computer.
HTH

As you're deploying to Azure you don't have much control about how SQL server is setup. MS have already set it up so that it is highly available.
Having said that, it seems that MS has been having some issues with SQL Azure over the last few days. We've been told that it only affected "a small number of users". At one point the service dashboard had 5 data centres affected by a problem. I had 3 databases in one of those data centres down twice for about an hour each time, but one database in another affected data centre that had no interruption.
If having a database connection is critical to your app, then the only way in the Azure environment to ensure against problems that MS haven't prepared against (this latest technical problem, earthquakes, meteor strikes) would be to co-locate your sql data in another data centre. At the moment the most practical way to do this is to use the synch framework. There is an ability to copy SQL Azure databases, but this only works within a data centre. With your data located elsewhere you could then point your app at the new database if the main one becomes unavailable.
While this looks good on paper though, this may not have helped you with the latest problem as it did affect multiple data centres. If you'd just been making database copies on a regular basis, that might have been enough to get you through. Or not.
(I would have posted this answer on server fault, but I couldn't find the question)

This is just about a programming/architecture issue, but you amy also want to ask the question on webmasters.stackexchange.com
You need to find out the root cause before drawing any conclusions.
However. my guess one of two things was the problem
The ISP connectivity differs for the test system and your production system. Either they use different ISPs, or different lines from the same ISP. When I worked in a hosting company we made sure that ou IP connectivity went through at least two different ISPS who did not share fibre to our premises (and where we could, they had different physical routes to the building - the homing ability of backhoes when there's a critical piece of fibre to dig up is well proven
Your datacentre had an issue with some shared production infrastructure. These might be edge routers, firewalls, load balancers, intrusion detection systems, traffic shapers etc. These typically are also often only installed on production systems. Defences here involve understanding the architecture and making sure the provider has a (tested!) DR plan for restoring SOME service when things go pair shaped. Neatest hack I saw here was persuading an IPS (intrusion prevention system) that its own management servers were malicious. And so you couldn't reconfigure it at all.
Just a thought - your DC doesn't host any of the Wikileaks mirrors, or Paypal/Mastercard/Amazon (who are getting DDOS'd by wikileaks supporters at the moment)?

Related

Moving to IasS on MS Azure

We have got an application running fine on On premises and plan to move it to IaaS on Ms Azure, do we need to make any changes to it or will it work as is?
I agree with the above post. You have not detailed if you are using Virtual Machines (Sql server or going to use Azure SQL). You will have to make choices about fail-over and geo redundancy, cloud services, etc. There are IP restrictions that may affect you (I don't know since I am not sure what you are moving). More than anything, I always warn people about the cost, it is difficult to understand. Here is an article series I wrote on Azure & SharePoint, you can skip the SharePoint stuff but the cost/limitation/VMs and such would still apply.
http://www.matthewjbailey.com/sharepoint-azure-guide/
We've managed a lift-and-shift of an on-premise Windows app into Azure, but I wouldn't say it's been without its pain. The above comments definitely ring true; you need to provide a bit more of an overview of what the current application does so that people can help answer your question.
In my experience, the only stumbling blocks to moving on-premise into Azure are:
Hardware requirements; i.e. if your application requires some specific hardware
Cost: It's not always cheaper to move large systems into Azure
Licensing: Make sure that your existing licensing is compatible with a cloud system which you don't control

web role and sql azure disaster recovery

I'm working on a quiet large and critical application. It's been deployed to azure with 3 web roles and sql azure db.
In case of disaster, we need to be able to restore both web roles and sql azure to different data centers. Could someone please help me how we can restore SQL Azure DB and Web Role(s) to different data center.
The simple answer is that you take regular backups of your SQL Azure database, which can be restored to a database in another datacenter. You will have a problem with the data since the last backup being lost, which becomes a more difficult problem to resolve โ€” the simplest may be to have a hot standby and use SQL Database Data Sync, but it may not be practical for all the data. Web roles are easier โ€” you redeploy them somewhere else, and change the connection strings to the database. You would also have to change the CNAME for your domain as they will be restored to a different cloudapp.net name.
You did ask for restore, and not failover, right? Performing a failover (where you have a hot standby) is a more difficult problem, particularly as far as data synchronisation is concerned.
I would go back and question 'disaster' and correlate with known facts. I am not sure of the outage history of Azure in specific data centres, but there have been significant Azure-wide outages (leap year 2012 and the certificate problem this year). The ability to restore to a different Azure datacentre won't help you in these scenarios. (Although AWS seems to mostly have regional outages) I don't think that a datacenter-specific recovery strategy is necessary on Windows Azure, but you may want to check the history and likelihood of datacenter-specific failures before making a final call. Having a multi-region architecture that distributes load and data across datacentres, and handles live traffic across all (say using traffic manager), has many benefits โ€” of side effect being builtin-disaster recovery - but comes at an architectural, development, hosting and bandwidth cost.
Go back and write the business case for your datacenter disaster recovery scenario. You may find that it is not worth it financially, or doesn't solve your real problem.

Azure backup region: website DataBase

I'm deploying an eCommerce site for my customer in spain. So, my first idea was to deploy it to the Azure Nortwest region.
The problem is that, even with the SLA of 99.99%, there could happen that the whole Azure datacenter would broke-down. (The same as the Amazon S3 services that went down for severeal hours some months ago).
My question is: How to protect against this eventual problem? I know that I can change my DNS C-Name to change the endpoint website, but it takes a lot of time to propagate DNS changes, and I must have a very-current backup of the database to be able to restore it into another server.
I know I can use traffic manager too, but I still have the problem with the database....
Which is the best aproach to solve this problem?
Also, I have some doubts about if this is reasonable to take this into consideration for a medium sized company.
Is anyone doing this, and is happy with the solution? 8ยท)
thanks in advance for your help,
luis
SQL Data Sync is a great way to synchronize the data between Azure SQL Databases. It works across Data Centers as well as regions. Using SQL Data Sync you could create a second database in another data center and synchronize the data between the two databases. There will likely be a period of time that you are exposed to loss however as the time between automatic syncs currently can't be lower that five minutes.

How do I make my Windows Azure application resistant to Azure datacenter catastrophic event?

AFAIK Amazon AWS offers so-called "regions" and "availability zones" to mitigate risks of partial or complete datacenter outage. Looks like if I have copies of my application in two "regions" and one "region" goes down my application still can continue working as if nothing happened.
Is there something like that with Windows Azure? How do I address risk of datacenter catastrophic outage with Windows Azure?
Within a single data center, your Windows Azure application has the following benefits:
Going beyond one compute instance, your VMs are divided into fault domains, across different physical areas. This way, even if an entire server rack went down, you'd still have compute running somewhere else.
With Windows Azure Storage and SQL Azure, storage is triple replicated. This is not eventual replication - when a write call returns, at least one replica has been written to.
Ok, that's the easy stuff. What if a data center disappears? Here are the features that will help you build DR into your application:
For SQL Azure, you can set up Data Sync. This facility synchronizes your SQL Azure database with either another SQL Azure database (presumably in another data center), or an on-premises SQL Server database. More info here. Since this feature is still considered a Preview feature, you have to go here to set it up.
For Azure storage (tables, blobs), you'll need to handle replication to a second data center, as there is no built-in facility today. This can be done with, say, a background task that pulls data every hour and copies it to a storage account somewhere else. EDIT: Per Ryan's answer, there's data geo-replication for blobs and tables. HOWEVER: Aside from a mention in this blog post in December, and possibly at PDC, this is not live.
For Compute availability, you can set up Traffic Manager to load-balance across data centers. This feature is currently in CTP - visit the Beta area of the Windows Azure portal to sign up.
Remember that, with DR, whether in the cloud or on-premises, there are additional costs (such as bandwidth between data centers, storage costs for duplicate data in a secondary data center, and Compute instances in additional data centers). .
Just like with on-premises environments, DR needs to be carefully thought out and implemented.
David's answer is pretty good, but one piece is incorrect. For Windows Azure blobs and tables, your data is actually geographically replicated today between sub-regions (e.g. North and South US). This is an async process that has a target of about a 10 min lag or so. This process is also out of your control and is purely for a data center loss. In total, your data is replicated 6 times in 2 different data centers when you use Windows Azure blobs and tables (impressive, no?).
If a data center was lost, they would flip over your DNS for blob and table storage to the other sub-region and your account would appear online again. This is true only for blobs and tables (not queues, not SQL Azure, etc).
So, for a true disaster recovery, you could use Data Sync for SQL Azure and Traffic Manager for compute (assuming you run a hot standby in another sub-region). If a datacenter was lost, Traffic Manager would route to the new sub-region and you would find your data there as well.
The one failure that you didn't account for is in the ability for an error to be replicated across data centers. In that scenario, you may want to consider running Azure PAAS as part of HP Cloud offering in either a load balanced or failover scenario.

Doubts about Windows Azure Platform Introductory Special

I'm considering to join the Windows Azure Platform Introductory Special, but I'm a little bit afraid of losing money with it. I don't wanna develop any fancy large scale application, I want to join just to learn Azure and do my experiments, what should I be afraid of?
In the transference, it says: "Data Transfers (per region)", what does that mean?
Can I put limits to stop the app if it goes over this plan in order to avoid get charged?
Can it be "pre pay" instead "bill pay"?
Would it be enough for a blog?
Any experiencie so far?
Kind regards.
As ligget pointed out, Azure isn't cost affect as a host for an application that can be easily deployed to a traditional shared hosting provider. Azure's target market are those that want dedicated resources without the need to micro-manage the infrasture and the capability to easily scale up/down based on demand.
That said, here's the answers to the questions you posted:
Data Transfers are based on bandwidth in and out of the hosting data center. bandwidth for communication occuring within components (SQL Azure, Windows Azure, Azure Storage, etc...) in the same datacenter are not billable.
Your usage is not currently capped when the free quotas are used up. However, you will recieved warning emails when those items approach their usage threadsholds.
There is the option to pay your subscription using a PO, but the minimum threshold for most of these operations is $500/month. So as a hobbyist, its unlikely you're wanting that route.
The introductory special does not provide enough resources for hosting a 24x7 personal blog. That level includes only 25hrs of compute resources. Each hour a single instance of your application is deployed will count against this, even if the application received no traffic. Think of it like renting office space. You still pay rent on the office even if there are no customers there.
All this said, there's still much to be learned with the introductory special. The azure development tools allows you to work with Windows Azure and Azure storage locally and get a feel for how they work. The introductory special then lets you deploy those solutions so you can see what works and what doesn't (not everything that works locally works hosted).
I would recommend you host your blog somewhere else - it's a waste of resources running it on Azure and you'll find much cheaper options. A recently introduced extra small instance would be a better choice in this case, but AFAIK it is charged separately as of now, e.g. even when you have an MSDN subscription those extra small instance hours do not count towards free Azure hours that come with the subscription.
There is no pre-pay option I know of and it's not possible to stop the app automatically. It'll be running until the deployment is deleted (beware! even if suspended/stopped the deployment will continue to accrue charges). I believe you will be sent a notification shortly before reaching your free hours threshold.
Be aware that when launching more than 1 instance you are charged for every hour of every instance combined. This can happen for example when you have more than one role in your Azure project (1 web role + 1 worker role - a separate instance will be started for each role).
Data trasfer means your entire data trasfer: blobs/Table storage/queues (transfers between your hosted service and storage account inside the same data center are free) + whatever data is transfered in/out of your hosted application, e.g. when somebody visits your pages. When you create storage accounts and hosted services in Azure you will specify a region that will be hosting your account/app - hosting in Asia is slightly more expensive than in Europe/U.S.
Your best bet would be to contact Microsoft with these questions.

Resources