Azure Traffic Manager make sure no traffic is flowing after disabling endpoint - azure

I am trying to find a powershell command which helps find out a way that there is no open connections or any traffic is flowing to endpoint1 or confirm traffic is moving smoothly to endpoint2 after disabling endpoint1:
$e[0].EndpointStatus = "Disabled"
Set-AzureRmTrafficManagerEndpoint -TrafficManagerEndpoint $e
Is there a command to do this? I am not able to find anything in google or should I use some wait command to wait for like a minute to flush out all open connections?
*Basically looking for a way to make sure all in-flight connections are drained from one endpoint before disabling it.

Traffic does not flow through your Traffic Manager instance. Therefore, the functionality you are asking for from Traffic Manager does not exist. Traffic Manager simply resolves DNS queries to an IP address of one of your endpoints using the routing method (priority, weighted, performance, etc) you configured it for.
After disabling an endpoint, you could still see traffic going to the disabled endpoint for a period of time measured by your Traffic Manager profile DNS TTL setting. For example, if you disable an endpoint at 3:01:00 and your DNS TTL setting is 90 seconds, then you could see traffic until 3:02:30 because that's how long it could take for any client's DNS cache to expire. One way to monitor this is through the Queries by Endpoint Returned metric described here. This should work in most cases. However, it's not 100%. Just because you disabled an endpoint in Traffic Manager won't stop a client that know's the IP address of your endpoint from calling it. You can decide whether or not this scenario is likely for your application and clients. So, to be absolutely certain there are no active clients using the endpoint, you will need some monitoring in place at the endpoint.
Finally, if you gracefully stop your web app, virtual machine, or other service hosting the endpoint you want disabled, then any active requests to your application will complete before the service shuts down, assuming your application completes requests in a reasonable time (a few seconds).
Documentation on how to test and verify your Traffic Manager settings is available here.

Related

Azure Load Balancing Solutions. Direct Traffic to Specific VMs

We are having difficulties choosing a load balancing solution (Load Balancer, Application Gateway, Traffic Manager, Front Door) for IIS websites on Azure VMs. The simple use case when there are 2 identical sites is covered well – just use Azure Load Balancer or Application Gateway. However, in cases when we would like to update websites and test those updates, we encounter limitation of load balancing solutions.
For example, if we would like to update IIS websites on VM1 and test those updates, the strategy would be:
Point a load balancer to VM2.
Update IIS website on VM1
Test the changes
If all tests are passed then point the load balancer to VM1 only, while we update VM2.
Point the load balancer to both VMs
We would like to know what is the best solution for directing traffic to only one VM. So far, we only see one option – removing a VM from backend address pool then returning it back and repeating the process for other VMs. Surely, there must be a better way to direct 100% of traffic to only one (or to specific VMs), right?
Update:
We ended up blocking the connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
Without any Azure specifics, the usual pattern is to point a load balancer to a /status endpoint of your process, and to design the endpoint behavior according to your needs, eg:
When a service is first deployed its status is 'pending"
When you deem it healthy, eg all tests pass, do a POST /status to update it
The service then returns status 'ok'
Meanwhile the load balancer polls the /status endpoint every minute and knows to mark down / exclude forwarding for any servers not in the 'ok' state.
Some load balancers / gateways may work best with HTTP status codes whereas others may be able to read response text from the status endpoint. Pretty much all of them will support this general behavior though - you should not need an expensive solution.
We ended up blocking connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
I had exactly the same requirement in an Azure environment which I built a few years ago. Azure Front Door didn't exist, and I had looked into using the Azure API to automate the process of adding and removing backend servers the way you described. It worked sometimes, but I found the Azure API was unreliable (lots of 503s reconfiguring the load balancer) and very slow to divert traffic to/from servers as I added or removed them from my cluster.
The solution that follows probably won't be well received if you are looking for an answer which purely relies upon Azure resources, but this is what I devised:
I configured an Azure load balancer with the simplest possible HTTP and HTTPS round-robin load balancing of requests on my external IP to two small Azure VMs running Debian with HAProxy. I then configured each HAProxy VM with backends for the actual IIS servers. I configured the two HAProxy VMs in an availability set such that Microsoft should not ever reboot them simultaneously for maintenance.
HAProxy is an excellent and very robust load balancer, and it supports nearly every imaginable load balancing scenario, and crucially for your question, it also supports listening on a socket to control the status of the backends. I configured the following in the global section of my haproxy.cfg:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats socket ipv4#192.168.95.100:9001 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
In my example, 192.168.95.100 is the first HAProxy VM, and 192.168.95.101 is the second. On the second server, these lines would be identical except for its internal IP.
Let's say you have an HAProxy frontend and backend for your HTTPS traffic to two web servers, ws1pro and ws2pro with the IPs 192.168.95.10 and 192.168.95.11 respectively. For simplicity sake, I'll assume we don't need to worry about HTTP session state differences across the two servers (e.g. Out-of-Process session state) so we just divert HTTPS connections to one node or the other:
listen stats
bind *:8080
mode http
stats enable
stats refresh 10s
stats show-desc Load Balancer
stats show-legends
stats uri /
frontend www_https
bind *:443
mode tcp
option tcplog
default_backend backend_https
backend backend_https
mode tcp
balance roundrobin
server ws1pro 192.168.95.10:443 check inter 5s
server ws2pro 192.168.95.11:443 check inter 5s
With the configuration above, since both HAProxy VMs are listening for admin commands on port 9001, and the Azure load balancer is sending the client's requests to either VM, we need to tell both servers to disable the same backend.
I used Socat to send the cluster control commands. You could do this from a Linux VM, but there is also a Windows version of Socat, and I used the Windows version in a set of really simple batch files. The cluster control commands would actually be the same in BASH.
stop_ws1pro.bat:
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
start_ws1pro.bat:
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
These admin commands execute almost instantly. Since the HAProxy configuration above enables the stats page, you should be able to watch the status change happen on the stats page as soon as it refreshes. The stats page will show the connections or sessions draining from the server you disabled over to the remaining enabled servers when you disable a backend, and then show them returning to the server once it is enabled again.

Browser failing to renegotiate DNS on persistent connection

I’m investigating a scenario with a live dashboard (Angular web app) that is refreshed every 5 seconds (polling). The API is sitting behind Azure Traffic Manager which will fail over to a second region in the event of a failure in the primary region. Keep in mind, Azure Traffic Manager works at the DNS level.
The problem I am facing is that the browser maintains a persistent connection to the primary region even after the Traffic Manager has failed over. The requests initially fail with 503s, but then continue to fail with 502s. The DNS lookup is never performed again as the requests occur more frequently than the keep-alive timeout. This causes the browser to continue to make requests to the failed region.
Is there anyway to explicitly kill the connection to force a DNS lookup? The only way I’ve found so far is to stop making requests for 2 minutes, or to close and reopen the browser. Neither is an acceptable solution for a dashboard that is supposed to be hands off and always fresh.
What’s interesting is after getting the browser to fail over to the secondary region, if I restart the primary region the browser will automatically switch back to the primary region after about a minute. This tells me the connection is respecting the DNS TTL when the service is functioning properly, but not when the server is unavailable. This makes no sense to me why the browser would lock onto a single IP forever when it’s not found.
Is there something I am missing about implementing georedundant failover with Traffic Manager for a web application? It seems very odd to me that the user would have to stop making requests for 2 minutes in any scenario before the browser would renegotiate the IP to the failed over server. Is it expected to turn of keep-alive to truly support near instant failover?
Here's a diagram that describes this scenario:
Diagram
Generally, Azure Traffic Manager works at the DNS level. Clients connect to the service endpoint directly, not through Traffic Manager. Traffic Manager has no way to track individual clients and cannot implement 'sticky' sessions.
For initial DNS lookup performance impact, you could find the explanation details here1 and here2
DNS name resolution is fast and results are cached. The speed of the
initial DNS lookup depends on the DNS servers the client uses for name
resolution. Typically, a client can complete a DNS lookup within ~50
ms. The results of the lookup are cached for the duration of the DNS
Time-to-live (TTL). The default TTL for Traffic Manager is 300
seconds.
The TTL value of each DNS record determines the duration of
the cache. Shorter values result in faster cache expiry and Longer
values mean that it can take longer to direct traffic away from a
failed endpoint. Traffic Manager allows you to configure the TTL as
low as 0 seconds and as high as 2,147,483,647 seconds. You could
choose the value that best balances the needs of your application.
Like the above, if you want the DNS lookup faster, you could set the TTL value as low as possible. Once the connection set up, the clients persistently connect to the selected endpoint until the endpoint is unhealthy via the health check.
You can enable and disable Traffic Manager profiles and endpoints. However, a change in endpoint status also might occur as a result of Traffic Manager automated settings and processes.. Get more details here.
For Geographic routing method,
The endpoint mapped to serve the geographic location based on the
query request IP’s is returned. If that endpoint is unavailable,
another endpoint will not be selected to failover to, since a
geographic location can be mapped only to one endpoint in a profile
(more details are in the FAQ). As a best practice, when using
geographic routing, we recommend customers to use nested Traffic
Manager profiles with more than one endpoint as the endpoints of the
profile.

Is Azure Traffic manager is reliable for failover? what are other problems I should be worried about?

I am planning to use Azure Traffic manager to do a failover of my app running on one Azure zone to Azure zone.
I need some suggestion, if that is the correct approach to do a failover ? We have seen issue with Azure that, most of the services in one region goes down for few hours. Although I understand that Azure traffic manager is not associated with the region. But is it possible that Azure traffic manager goes down or that traffic manager endpoint is not reachable although my backend webapp is reachable?
If I am planning to use Azure traffic manager, what are other problems I should be worried about ?
I've been working with TM for some time now, so here are a few issues I haven't seen mentioned before:
Keep-Alive
If your service allows Keep-Alive, then your DNS entry will be ignored as long as the connection remains open. I've seen some exceptionally odd behavior result from this, including users being stuck on a fallback page since they kept using the connection, causing it to remain open indefinitely. If you have access to IIS Manager, you can force Keep-Alive to be false.
Browser DNS Caching
Most browsers have their own DNS cache, and very few honor DNS Time To Live. In my experience Chrome is pretty responsive, with IE and Edge having significant delays if you need them to rollover quickly. I've heard that Opera is particularly bad.
Other DNS Caching
Even if you're not accessing your service through a browser, other components can have DNS caches, and some of them will allow you to manage the cache yourself. This can in theory even depend on ISP's DNS caching, though reports on the magnitude of this vary significantly.
Traffic Manager works at the DNS level, which itself is replicated. However, even then, you should still build in redundancy into your solution.
Take a look at the Azure Architecture Center under "Make all things redundant" and you will see a recommendation for Traffic Manager:
consider adding another traffic management solution as a failback. If
the Azure Traffic Manager service fails, change your CNAME records in
DNS to point to the other traffic management service.
The Traffic Manager internal architecture is resilient to the failure of any single Azure region. So, even if a region fails, Traffic Manager should stay up. That applies to all Traffic Manager components: control plane, endpoint monitoring, and DNS name servers.
Since Traffic Manager works at the DNS level, it doesn't have an 'endpoint' that proxies your traffic--it uses DNS to direct clients to the appropriate endpoint, and clients then connect to those endpoints directly. Thus, an unreachable endpoint is an application problem, not a Traffic Manager problem.
That said, if the Traffic Manager DNS name servers are down, you have a serious problem. You DNS resolution path will fail and your customers will be impacted. The only solution is to either accept the risk (small, but can never be zero) or have a plan in place to use another DNS system, either in parallel or failover. This is not a limitation of Traffic Manager; you could say the same about any DNS-based traffic management system.
The earlier answer from DornaDigital is very good (other than the first point which suggests DNS caching will protect you through a name server outage--it won't). It covers some important points. In short, DNS-based failover works well for new sessions. Existing clients may have to refresh or even close their browser and reconnect.
I also agree with the details provided dornadigital.
There are considerations for front end applications as well. The browsers all have different thresholds for how long they maintain persistent connections. Chromium, for example, currently maintains a connection unless there is inactivity for 300 seconds.
In our web applications, we are detecting the failover by the presence of a certain number of failed requests to the endpoint. After requests begin failing, we pause requests for 301 seconds to allow the connection to reset. This allows the DNS change from the traffic manager to be applied to subsequent requests. We pop up a snackbar to indicate to the user that we are having an issue and display the count down when requests will resume. Similar to Gmail when it has an issue connecting to their servers.
I hope that gives you one idea on how to build some redundancy into your web apps.
I disagree with Jonathan as his understanding of the resiliency of the Traffic Manager service is in disagreement with Microsoft's own documentation on the subject.
When you provision Azure Traffic Manager, you select a region in which to deploy the service. I (correctly) inferred this to assert if said region were to fail, the Traffic Manager service could also be impacted and in turn, your application solution would not properly fail over to the secondary region.
According to Microsoft's Azure Application Architecture Guide, under "Make all things redundant", a customer should deploy Traffic Manager into more than one region:
Include redundancy for Traffic Manager. Traffic Manager is a possible failure point. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to point to the other traffic management service.
Azure Application Architecture Guide - Make all things redundant
My thought and intention is to not deploy Traffic Manager within the primary service region, but instead to deploy it into the secondary (failover region) and a tertiary (3rd) region as a backup.

Azure Website doesnt detect Traffic Manager Change

I have an Azure website (website.mycompany.com) that uses a WCF service for some data. The WCF Service sits behind an Azure Traffic Manager (service.mycompany.com) running in "priority mode", with 2 instances of the service for failover handling. With priority mode, the primary always serves up the data first, unless it's unavailable. If unavailable, the 2nd instance will reply.. and so on down the line.
We've had a few instances recently where the primary endpoint for service.mycompany.com was offline. For "partnerships" who point to service.mycompany.com, they detected the switch and all was fine. Lately however, our own site (website.mycompany.com) does NOT detect the traffic manager switch, and the website has errors since the service fails to reply.
Our failover endpoint in these instances is up, and in the past the Azure website detected the switch, it's only recently we've encountered this issue. Has anyone experienced similar issues? Are there perhaps any DNS changes that we need to tweak in our Azure Website to help it detect TTL's?
Has anyone experienced similar issues?
Do you mean the traffic manager can't switch to another endpoint immediately?
Traffic manager works at the DNS level, here are the reasons why traffic manager can't switch immediately:
The duration of the cache is determined by the 'time-to-live' (TTL) property of each DNS record. Shorter values result in faster cache expiry and thus more round-trips to the Traffic Manager name servers. Longer values mean that it can take longer to direct traffic away from a failed endpoint.
The traffic manager endpoint monitor effects the response time. More information about how azure traffic manager works, please refer to the link.
The following timeline is a detailed description of the monitoring process.
Also we can check traffic manager profile using nslookup and ipconfig in windows. About how to vertify traffic Manager settings, please refer to the link.
By the way, because traffic manager works at the DNS level, it cannot influence existing connections to any endpoint. When it directs traffic between endpoints (either by changed profile settings, or during failover or failback), Traffic Manager directs new connections to available endpoints. However, other endpoints might continue to receive traffic via existing connections until those sessions are terminated. To enable traffic to drain from existing connections, applications should limit the session duration used with each endpoint.
I'm going to refer you to my answer here because while the situation isn't exactly the same, it seems like it could have the same solution. To summarize, I find it likely that you have a connection left open to the down service that isn't being properly closed. This connection is independent of TTL, which only deals with DNS caching, and as such bypasses Traffic Manager completely.

Do I need a Site Down page for Azure Traffic Manager in Priority Mode

I'm setting up an Azure Traffic Manager, in Priority Mode, for my website. I have a primary and a failover location, both being monitoring by a "FailoverMonitor.aspx" page - if any resources are down for the appropriate resource\region, I return a 500 error. I wanted to also make sure an error message was returned to the user if all locations were down.
In my testing, I decided to break both my primary (priority 1) and failover (priority 2), and in doing so, I saw that the primary location was served up.
This kind of surprised me, I half expected the site to not return anything at all.. but instead it served up a site that is considered to be in a "degraded" status.
I added a 3rd endpoint to the traffic manager that returns a "sorry we're down" page - but is this the intended methodology to return such a message? I just want to make sure I'm going through all intended steps and not misusing the service. Thanks!
When all the endpoints being monitored by Traffic Manager for a given profile are down, it makes a "best case effort" and responds as if all the endpoints are actually in an online state, instead of not returning any endpoint at all.
More details of this and other endpoint monitoring details can be found at: https://azure.microsoft.com/en-us/documentation/articles/traffic-manager-monitoring/
Relevant section copy pasted below:
What happens if all Traffic Manager endpoints (excluding endpoints with a Disabled or Stopped status) are failing their health checks, and show a Degraded status?
This most commonly is caused by an error in the configuration of the service (such as an access control list [ACL] blocking the Traffic Manager health checks), or an error in the configuration of the Traffic Manager profile (such as an incorrect monitoring path).
In this case, Traffic Manager makes a "best effort" attempt and responds as if all the Degraded status endpoints actually are in an online state. This is preferable to the alternative, which would be to not return any endpoint in the DNS response.
A consequence of this behavior is that if Traffic Manager health checks are not configured correctly, it might appear from the traffic routing as though Traffic Manager is working properly. However, in this case, endpoint failover will not happen if an endpoint fails, and this affects overall application availability. To ensure that this does not occur, it is important to check that the profile shows an Online status, and not a Degraded status. An Online status shows that the Traffic Manager health checks are working as expected.
I wanted to also make sure an error message was returned to the user if all locations were down.
Since Traffic Manager is a DNS only solution i'm not sure who's supposed to serve the "We're down" page..
A 3rd endpoint serving a static page should do the job.

Resources