We are having difficulties choosing a load balancing solution (Load Balancer, Application Gateway, Traffic Manager, Front Door) for IIS websites on Azure VMs. The simple use case when there are 2 identical sites is covered well – just use Azure Load Balancer or Application Gateway. However, in cases when we would like to update websites and test those updates, we encounter limitation of load balancing solutions.
For example, if we would like to update IIS websites on VM1 and test those updates, the strategy would be:
Point a load balancer to VM2.
Update IIS website on VM1
Test the changes
If all tests are passed then point the load balancer to VM1 only, while we update VM2.
Point the load balancer to both VMs
We would like to know what is the best solution for directing traffic to only one VM. So far, we only see one option – removing a VM from backend address pool then returning it back and repeating the process for other VMs. Surely, there must be a better way to direct 100% of traffic to only one (or to specific VMs), right?
Update:
We ended up blocking the connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
Without any Azure specifics, the usual pattern is to point a load balancer to a /status endpoint of your process, and to design the endpoint behavior according to your needs, eg:
When a service is first deployed its status is 'pending"
When you deem it healthy, eg all tests pass, do a POST /status to update it
The service then returns status 'ok'
Meanwhile the load balancer polls the /status endpoint every minute and knows to mark down / exclude forwarding for any servers not in the 'ok' state.
Some load balancers / gateways may work best with HTTP status codes whereas others may be able to read response text from the status endpoint. Pretty much all of them will support this general behavior though - you should not need an expensive solution.
We ended up blocking connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
I had exactly the same requirement in an Azure environment which I built a few years ago. Azure Front Door didn't exist, and I had looked into using the Azure API to automate the process of adding and removing backend servers the way you described. It worked sometimes, but I found the Azure API was unreliable (lots of 503s reconfiguring the load balancer) and very slow to divert traffic to/from servers as I added or removed them from my cluster.
The solution that follows probably won't be well received if you are looking for an answer which purely relies upon Azure resources, but this is what I devised:
I configured an Azure load balancer with the simplest possible HTTP and HTTPS round-robin load balancing of requests on my external IP to two small Azure VMs running Debian with HAProxy. I then configured each HAProxy VM with backends for the actual IIS servers. I configured the two HAProxy VMs in an availability set such that Microsoft should not ever reboot them simultaneously for maintenance.
HAProxy is an excellent and very robust load balancer, and it supports nearly every imaginable load balancing scenario, and crucially for your question, it also supports listening on a socket to control the status of the backends. I configured the following in the global section of my haproxy.cfg:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats socket ipv4#192.168.95.100:9001 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
In my example, 192.168.95.100 is the first HAProxy VM, and 192.168.95.101 is the second. On the second server, these lines would be identical except for its internal IP.
Let's say you have an HAProxy frontend and backend for your HTTPS traffic to two web servers, ws1pro and ws2pro with the IPs 192.168.95.10 and 192.168.95.11 respectively. For simplicity sake, I'll assume we don't need to worry about HTTP session state differences across the two servers (e.g. Out-of-Process session state) so we just divert HTTPS connections to one node or the other:
listen stats
bind *:8080
mode http
stats enable
stats refresh 10s
stats show-desc Load Balancer
stats show-legends
stats uri /
frontend www_https
bind *:443
mode tcp
option tcplog
default_backend backend_https
backend backend_https
mode tcp
balance roundrobin
server ws1pro 192.168.95.10:443 check inter 5s
server ws2pro 192.168.95.11:443 check inter 5s
With the configuration above, since both HAProxy VMs are listening for admin commands on port 9001, and the Azure load balancer is sending the client's requests to either VM, we need to tell both servers to disable the same backend.
I used Socat to send the cluster control commands. You could do this from a Linux VM, but there is also a Windows version of Socat, and I used the Windows version in a set of really simple batch files. The cluster control commands would actually be the same in BASH.
stop_ws1pro.bat:
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
start_ws1pro.bat:
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
These admin commands execute almost instantly. Since the HAProxy configuration above enables the stats page, you should be able to watch the status change happen on the stats page as soon as it refreshes. The stats page will show the connections or sessions draining from the server you disabled over to the remaining enabled servers when you disable a backend, and then show them returning to the server once it is enabled again.
Related
I have 2 Azure VM sitting behind a Standard Azure Load Balancer.
The load balancer has a healthprobe pinging every 5 seconds with HTTP on /health for each VM.
Interval is set to 5, port is set to 80 and /health, and "unhealthy threshold" is set to 2.
During deployment of an application, we set the /health-endpoint to return 503 and then wait 35 seconds to allow the load balancer to mark the instance as down, and so stop sending new traffic.
However, Load balancer does not seem to fully take the VM out of load. It still sends traffic inbound to the down instance, causing downtime for our customers.
I can see in IIS-logs that the /health-endpoint is indeed returning 503 when it should.
Any ideas whats wrong? Can it be some sort of TCP keep-alive?
I got confirmation from microsoft that this is working "as intended", which makes the Azure Load Balancer a bad fit for web applications. This is the answer from Microsoft:
I was able to discuss your observation with the internal team.
They explained that the Load balancer does not currently have
“Connection Draining” feature and would not terminate existing
connections.
Connection Draining is available with the Application Gateway
Connection Draining.
I heard this is being planning for the Load balancer also as future
Road map . You could also add your voice to the request for this
feature for the Load balancer by filling the feedback Form.
Load Balancer is a pass through service which does not terminate existing TCP connections where the flow is always between the client and the VM's guest OS and application. If a backend endpoint's health probe fails, established TCP connections to this backend endpoint continue, but it will stop sending new flows to the respective unhealthy instance. This is by design to give you opportunity to gracefully shutdown from the application to avoid any unexpected and sudden termination of ongoing application workflow.
Also you may consider configuring TCP reset on idle https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset to reduce number of idle connections.
I would suggest you the following approach
You could have to place a healthcheck.html page on each of your VM's. As long as the probe can retrieve the page, the load balancer will keep sending user requests to the VM.
When you do the deployment, simply rename the healthcheck.html to a different name such as _healthcheck.html. This will cause the probe to start receiving HTTP 404 errors and will take that machine out of the load balanced rotation.
After your deployment have been completed, rename _healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending requests to this VM again.
Thanks,
Manu
Disclaimers: I come from AWS background but relatively very new to GCP. I know there are a number of existing similar questions (e.g, here and here etc) but I still cannot get it work since the exact/detailed instructions are still missing. So please bear with me to ask this again.
My simple design:
Public HTTP/S Traffic (Ingress) >> GCP Load Balancer >> GCP Servers
GCP Load Balancer holds the SSL Cert. And then it uses Port 80 for downstream connections to the Servers. Therefore, LB to the Servers are just HTTP.
My question:
How do I prevent the incoming HTTP/S Public Traffic from reaching to the GCP Servers directly? Instead, only allow the Load Balancer (as well as it's Healthcheck Traffic)?
What I tried so far:
I went into Firewall Rules and removed the previously allowing rule of Ports 80/443 (Ingress Traffic) from 0.0.0.0/0. And then, added (allowed) the External IP address of Load Balancer.
At this point, I simply expected the Public Traffic should be rejected but the Load Balancer's. But in reality, both seemed to be rejected. Nothing reached the Servers anymore. The Load Balancer's External IP wasn't seemed to be recognised.
Later I also noticed the "Healthchecks" were also not recognised anymore. Therefore Healthchecks couldn't reach to Servers and then failed. Hence the Instances were dropped by Load Balancer.
Please also note that: I cannot pursue the approach of simply removing the External IPs on the Servers. (Although many people say this would work.) But we still want to maintain the direct SSH accesses to the Servers (by not using a Bastion Instance). Therefore I still need the External IPs, on each and every Web Servers.
Any clear (and kind) instructions will be very much appreciated. Thank you all.
You're able to setup HTTPS connectivity between your load balancer and your back-end servers while using HTTP(S) load balancer. To achieve this goal you should install HTTPS certificates on your back-end servers and configure web-servers to use them. If you decided to completely switch to HTTPS and disable HTTP on your back-end servers you should switch your health check from HTTP to HTTPS also.
To make health check working again after removing default firewall rule that allow connection from 0.0.0.0/0 to ports 80 and 443 you need to whitelist subnets 35.191.0.0/16 and 130.211.0.0/22 which are source IP ranges for health checks. You can find step by step instructions how to do it in the documentation. After that, access to your web servers still be restricted but your load balancer will be able to use health check and serve your customers.
I'm trying to load balance a nodejs Server sent event backend and I need to know if there is a way to distribute the new connections to the instances with the least connected clients. The problem I have is when scaling up, the routing continues sending new connections to the already saturated instance and since the connections are long lived this simply won't work.
What options do I have for horizontal scaling long lived connections?
It looks like you want a Load Balancer that can provide both "sticky sessions" and use the "least connection" instead of "round-robin" policy. Unfortunately, NGINX cannot provide this.
HAProxy (High Availability Proxy) allows for this:
backend bk_myapp
cookie MyAPP insert indirect nocache
balance leastconn
server srv1 10.0.0.1:80 check cookie srv1
server srv2 10.0.0.2:80 check cookie srv2
If you need ELB functionality and want to roll it all manually, take a look at this guide.
You might also want to make sure classic AWS ELB "sticky session" configuration or the newer ALB "sticky session" option does not meet your needs. ELB normally sends connection to upstream server with the least "load", and when combining with sticky sessions might be enough.
Since you are using AWS, I'd recommend Elastic Beanstalk for your Node.js application deployment. The official documentation provides good examples, like this one. Note that Beanstalk will automatically create an Elastic Load Balancer for you, which is what you're looking for.
By default, Elastic Beanstalk creates an Application Load Balancer for
your environment when you enable load balancing with the Elastic
Beanstalk console or the EB CLI. It configures the load balancer to
listen for HTTP traffic on port 80 and forward this traffic to
instances on the same port.
[...]
Note:
Your environment must be in a VPC with subnets in at least two
Availability Zones to create an Application Load Balancer. All new AWS
accounts include default VPCs that meet this requirement. If your
environment is in a VPC with subnets in only one Availability Zone, it
defaults to a Classic Load Balancer. If you don't have any subnets,
you can't enable load balancing.
Note that the configuration of a proper health check path is key to properly balance requests, as you mentioned in your question.
In a load balanced environment, Elastic Load Balancing sends a request
to each instance in an environment every 10 seconds to confirm that
instances are healthy. By default, the load balancer is configured to
open a TCP connection on port 80. If the instance acknowledges the
connection, it is considered healthy.
You can choose to override this setting by specifying an existing
resource in your application. If you specify a path, such as /health,
the health check URL is set to HTTP:80/health. The health check URL
should be set to a path that is always served by your application. If
it is set to a static page that is served or cached by the web server
in front of your application, health checks will not reveal issues
with the application server or web container.
EDIT: If you're looking for sticky sessions, as I described in the comments, follow the steps provided in this guide:
To enable sticky sessions using the console
Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
On the navigation pane, under LOAD BALANCING, choose Target Groups.
Select the target group.
On the Description tab, choose Edit attributes.
On the Edit attributes page, do the following:
a. Select Enable load balancer generated cookie stickiness.
b. For Stickiness duration, specify a value between 1 second and 7 days.
c. Choose Save.
I am trying to find a powershell command which helps find out a way that there is no open connections or any traffic is flowing to endpoint1 or confirm traffic is moving smoothly to endpoint2 after disabling endpoint1:
$e[0].EndpointStatus = "Disabled"
Set-AzureRmTrafficManagerEndpoint -TrafficManagerEndpoint $e
Is there a command to do this? I am not able to find anything in google or should I use some wait command to wait for like a minute to flush out all open connections?
*Basically looking for a way to make sure all in-flight connections are drained from one endpoint before disabling it.
Traffic does not flow through your Traffic Manager instance. Therefore, the functionality you are asking for from Traffic Manager does not exist. Traffic Manager simply resolves DNS queries to an IP address of one of your endpoints using the routing method (priority, weighted, performance, etc) you configured it for.
After disabling an endpoint, you could still see traffic going to the disabled endpoint for a period of time measured by your Traffic Manager profile DNS TTL setting. For example, if you disable an endpoint at 3:01:00 and your DNS TTL setting is 90 seconds, then you could see traffic until 3:02:30 because that's how long it could take for any client's DNS cache to expire. One way to monitor this is through the Queries by Endpoint Returned metric described here. This should work in most cases. However, it's not 100%. Just because you disabled an endpoint in Traffic Manager won't stop a client that know's the IP address of your endpoint from calling it. You can decide whether or not this scenario is likely for your application and clients. So, to be absolutely certain there are no active clients using the endpoint, you will need some monitoring in place at the endpoint.
Finally, if you gracefully stop your web app, virtual machine, or other service hosting the endpoint you want disabled, then any active requests to your application will complete before the service shuts down, assuming your application completes requests in a reasonable time (a few seconds).
Documentation on how to test and verify your Traffic Manager settings is available here.
I have an Azure website (website.mycompany.com) that uses a WCF service for some data. The WCF Service sits behind an Azure Traffic Manager (service.mycompany.com) running in "priority mode", with 2 instances of the service for failover handling. With priority mode, the primary always serves up the data first, unless it's unavailable. If unavailable, the 2nd instance will reply.. and so on down the line.
We've had a few instances recently where the primary endpoint for service.mycompany.com was offline. For "partnerships" who point to service.mycompany.com, they detected the switch and all was fine. Lately however, our own site (website.mycompany.com) does NOT detect the traffic manager switch, and the website has errors since the service fails to reply.
Our failover endpoint in these instances is up, and in the past the Azure website detected the switch, it's only recently we've encountered this issue. Has anyone experienced similar issues? Are there perhaps any DNS changes that we need to tweak in our Azure Website to help it detect TTL's?
Has anyone experienced similar issues?
Do you mean the traffic manager can't switch to another endpoint immediately?
Traffic manager works at the DNS level, here are the reasons why traffic manager can't switch immediately:
The duration of the cache is determined by the 'time-to-live' (TTL) property of each DNS record. Shorter values result in faster cache expiry and thus more round-trips to the Traffic Manager name servers. Longer values mean that it can take longer to direct traffic away from a failed endpoint.
The traffic manager endpoint monitor effects the response time. More information about how azure traffic manager works, please refer to the link.
The following timeline is a detailed description of the monitoring process.
Also we can check traffic manager profile using nslookup and ipconfig in windows. About how to vertify traffic Manager settings, please refer to the link.
By the way, because traffic manager works at the DNS level, it cannot influence existing connections to any endpoint. When it directs traffic between endpoints (either by changed profile settings, or during failover or failback), Traffic Manager directs new connections to available endpoints. However, other endpoints might continue to receive traffic via existing connections until those sessions are terminated. To enable traffic to drain from existing connections, applications should limit the session duration used with each endpoint.
I'm going to refer you to my answer here because while the situation isn't exactly the same, it seems like it could have the same solution. To summarize, I find it likely that you have a connection left open to the down service that isn't being properly closed. This connection is independent of TTL, which only deals with DNS caching, and as such bypasses Traffic Manager completely.