Azure traffic manager irratic load balancing causing issues - azure

I have an azure traffic manager configured to route traffic over two data centres based on performance (latency). The two DCs are replicas of each other, and is engineered in this way so that our global customers are givin a good performance no matter where they are connecting from.
The application tiers do not hold state, and the data tiers are set up using SQL merge replication on a 1 minute timer to keep the DBS in sync as to provide service continuity in the event of a Datacenter failover.
The issues that I have found is that the traffic managers routing is slightly erratic. I have observed registering a user under one Datacenter only to find the login has bee routed to the other one - the SQL replication hasn't synced at this point and the second DC isn't aware that the user exists. Even though the user both registered and logged in from the same location! The DCs are in the West US and South east asia.
I'm looking at a few options to fix this. Solution A is to Silo the users data to a specific data center, therefor whatever DC the user registers to is used thereafter. I wouldn't have syncing issues but I lose the advantage of continuity that the SQL replication provides.
Solution B is to use a different more predictable global load balancer. But first I want some opinions and to perhaps see if I am doing something wrong or perhaps my architecture is flawed.
Thanks for advice.

My solution had challenges using the traffic manager also, although slightly different to yours. The traffic manager is a great value solution if it can work for you. As far as I am aware no configuration in traffic manager allows it to be aware of sessions, therefore it is blinkered to its config setting of performance in your case. This means its acting erratic based on your expectation for it to use sessions to be persistent to an endpoint subject to it being available.
In terms of your solution, it is very much Enterprise. To move backwards with solution A probably doesn't fit the requirement given what you went to the effort of building. Solution B brings many more features that Traffic Manager lacks and one of them will resolve your issue. For other reasons I am looking at
http://kemptechnologies.com/uk/server-load-balancing-appliances/virtual-loadbalancer/loadmaster-azure
It is designed for Azure and is available as a pre-installed VM. There are others available but this has been my choice and what I would use if I were in your position and wanted to keep the level of resilience you currently have.
Hope this helps.

Related

cloud terminology, elasticity vs scalability and compute vs networking

first of all would like to say a big thank you to all who is reading my questions, I really appreciate your time and hope to be able contribute back, second, I did see part of my question already asked in another thread but it does not answer it plus my questions have a bit different angle, so here it goes:
Does not elasticity already include scalability? I see scalability and elasticity go as two separate features of the cloud in service promotions, is there a technical difference or is it just marketing play of terminology?
I have similar confusion about compute and networking, does not compute power already include networking, I saw it being briefly displayed as two separate advantages of cloud service
I will give it a try :) But it will largely be my understanding and less citing of provider documentation.
Elasticity vs. Scalability
I interprete elasticity as the capability to react to more or less daily variation in resource needs. Unlike reserved instances or your own server hardware "in the basement" the cloud provider offer both the resources and the managment tools to let you use varying amounts of compute, network , ... resources from hour to hour or day to day.
So elasticity (in my mind) solves the business need to react / adapt to changing demand that might follow a pattern like day / night or season / off-season but might be relatively stable from year to year or even week to week.
Scalability in my mind is more than everything the ability of these "hyper-scalers" to allow customers / you to grow your system continously and almost with no upper limit. So I would say the average (i.e.) weekly usage can go up every week for months after months and you wouldn't run out of upgrade options with the cloud providers to help you serve more and more requests.
Compute and Networking
The cloud uses "software defined networking" which abstracts all that hardware stuff like switches and routers from you as a user and offers connectivity options that would be hard to realize on your own / with traditional networking. So the networking capabilities of a major cloud provider are a feature set of their own, with lots of room for system improvements and capabilities. Therefor it is designed, serviced, billed... separately from other service classes like compute or storage.
A simple example or illustration of that might be a virtual machine (or multiple) that on their own as stand alone compute resources might have a network interface and a public ip attached to them. You can reach that machine, the machine can reach to internet (if you configure it that way) and you can install stuff on it. That's it - you have compute power.
But when you group virtual machines in i.e. application security groups and use these groups as objects in resources that allow, deny, redirect traffic internally or externally and maybe tunnel traffic to these compute resources to your on-premise resources (like in many cases Active Directory Domain Services) you start to use advanced networking capabilites. But obviously there's much more and networking can be one of the hardest parts of certification exams on cloud topics.

IdentityServer4 high availability?

Is there a recommended way how to configure an IdentityServer for high availability? What are pros/cons for one solution over the other.
Currently I use ARR for it, but I've some issues and I'm not sure if it is the best solution anyhow?
It's not really a question specific to ASP.Net Core or IDS4 and to do it to a degree where it really is truly HA is quite hard.
That said, if not talking about Amazon/Azure stuff... I'd do something like this:
Two sets of servers, each in geographically separate sites, combined with a multi-site SQL Server AlwaysOn High Availability Group using synchronous commits and automatic fail-over and the app configured to suit (i.e. MultiSubnetFailover=true).
In front of the each set of web servers have a traffic manager (with its own HA) with health checking enabled and then in front of that have active DNS failover via a service like Dyn (https://dyn.com/active-failover/)
This would allow you to suffer an entire data centre going down or any individual server and carry on like nothing happened.

What is the best Azure product for routing traffic between autonomic locations

We have a website which provides services for people based in particular city.
We want to scale and provide it for more cities but we want to remain separated IT within the city realm: one webhost, cloud service , database etc for one location. It does not only enables us to scale individually (some cities are bigger than other several times) but most significantly it improves our code-base and db queries to not use city's predicates - despite the fact it is more expensive in general.
At the same time we do not want to use subdomain. User can switch city through dropdown and request should go to appropriate VM without url being changed so the routing should work seamlessly.
Based on Azure documentation we are still not sure what solution would meet our needs, Traffic Manager, Load balancer or custom redirects.
How you accomplish this is ultimately up to you, but from an Azure-specific perspective, the only multi-region built-in load-balancing service is Traffic Manager. This operates in one of three routing modes:
Primary/failover
Round-robin
Closest (based on latency, not physical distance)
For any other type of routing (such as letting the user choose location, per your question), you'd need to implement this on your own or via 3rd-party service (and how to accomplish that would be a matter of opinion/debate/discussion, which is off-topic for StackOverflow).
Since you're looking to have a separate DB, cloud-role and webhost per city, I do not see how you can get away from doing subdomains.
Do you not want subdomains because of SEO? If so, it'd be easier to find another way to solve SEO problem.
But whatever Traffic Manager or other DNS based routing solution you use, it'll be splitting users by where they come FROM and not where they're going TO.
The destination problem is solved thru separate sub-domains

How to deal with Azure Outages, current one was a network drop between websites and SQL Database

We just suffered a SQL Database connectivity issue on Azure. Although very quick, around 1 minute, it did kick all users out, and/or raised Elmah Errors such as:
The wait operation timed out ...
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection
Even glitches like this compromises confidence. I am trying to understand about good approaches to confront these transitory outages. Some thoughts that come to mind include:
Have some code that checks that all required services are running before using them and keep checking with provide friendly error message until they are. I think there is a tendency to assume all is available and working, and I wonder whether this is a dangerous assumption in the world of cloud. I suppose this is more an approach one would take when building a distributed application, although one may not for a database which is usually close to the web application.
Use failover procedures such as TrafficManager. However it is expensive as one now has >1 instance and also one needs to take care of the syncing data across >1 DB etc. Associated link on Failover procedure in Azure
Make sure Custom Error pages are used so Yellow Screen of death (YSOD) is not seen:
<customErrors mode="RemoteOnly" defaultRedirect="~/Error/Error" />
Although YSOD was seen by a colleague, not sure how with the above in force. Once criticism I have of Azure is that if Websites are down, then one can get bad error pages, only provided by Azure and not customisable, although I was advised that using something like CloudFlare can sort this issue.
I think a) is the most interesting concept. Should we code Azure Web Apps as if they are WAN rather than LAN applications, and assume nodes could be down, and so check beforehand?
I would really appreciate thoughts on the above. Our feeling is that Azure is getting a few too many of these outage blips now, which may be due to increased customers... not sure. Although no doubt within the 99.9% annual SLA.
EDIT1
A useful MSDN Azure Cloud Architecture article on this:
Resilient Azure Website Architectures

Cross-colo fail-over design, DNS level fail-over?

I'm interested in cross-colo fail-over strategies for web applications, such that if the main site fails users seamlessly land at the fail-over site in another colo.
The application side of things looks to be mostly figured out with a master-slave database setup between the colos and services designed to recover and be able to pick up mid-stream. I'm trying to figure out the strategy for moving traffic from the main site to the fail-over site. DNS failover, even with low TTLs, seems to carry a fair bit of latency.
What strategies would you recommend for quickly moving traffic between colos, assuming the servers at the main colo are unreachable?
If you have other interesting experience / words of wisdom about cross-colo failover I'd love to hear those as well.
DNS based mechanisms are troublesome, even if you put low TTLs in your zone files.
The reason for this is that many applications (e.g. MSIE) maintain their own caches which ignore the TTL. Other software will do a single gethostbyname() or equivalent call and store the result until the program is restarted.
Worse still, many ISPs' recursive DNS servers are known to ignore TTLs below their own preferred minimum and impose their own higher TTLs.
Ultimately if the site is to run from both data centers without changing its IP address then you need to look at arrangements for "Multihoming" via global BGP4 route announcements.
With multihoming you need to get at least a /24 netblock of "provider independent" (aka "PI") IP address space, and then have that only be announced to the global routing table from the backup site if the main site goes offline.
As for DNS, I like to reference, "Why DNS Based Global Server Load Balancing Doesn't Work". For everything else -- use BGP.
Designing networks in order to load balance using BGP is still not an easy task and I myself certainly am not an expert on this. It's also more complex than Wikipedia can tell you but there are a couple interesting articles on the web that detail how it can be done:
Load Balancing In BGP Networks
Load Sharing in Single and Multi homed environments
There is always more if you search for BGP and load balancing. There are also a couple whitepapers on the net which describe how Akamai does their global loadbalancing (I believe it's BGP too.), which is always interesting to read and learn about.
Beyond the obvious concepts you can use software and hardware to achieve, you might also want to check with your ISP/provider/colo if they can set you up.
Also, no offense in regard to your choice of colo (Who's the provider?), but most places should be setup to deal with downtimes and so on, they should not require you to take actions. Of course floods or aliens can always strike, but in that case I guess there are more important issues. :-)
If you can, Multicast - http://en.wikipedia.org/wiki/Multicast or AnyCast - http://en.wikipedia.org/wiki/Anycast

Resources