SignalR connections on Azure with load balancer

SignalR connections on Azure with load balancer - azure

I have the next setup in Azure Resource Manager :
1 scale set with 2 virtual machines having Windows Server 2012 .
1 Azure Redis cache (C1 standard)
1 Azure load balancer (Layer 4 in the OSI network reference stack)
Load balancer is basically configured using :
https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-get-started-internet-portal
. Session persistence is set to none for the rules.
On both VMs from scale set I have deployed a test web app which uses SignalR 2 on .Net 4.5.2.
The test web app uses Azure Redis cache as backplane .
The web app project can be found here on github : https://github.com/gaclaudiu/SignalrChat-master.
During the tests I did notice that after a signalr connection is opened , all the data sent from the client, in the next requests, arrives on the same server from the scale set , it seems to me that SignalR connection know on which sever from the scale set to go.
I am curios to know more on how this is working , I tried to do some research on the internet but couldn't find something clear on this point .
Also I am curios to know what happens in the next case :
Client 1 has a Signalr opened connection to Server A.
Next request from the client 1 through SignalR goes to the Server B.
Will this cause an error ?
Or client will just be notified that no connection is opened and it will try to open a new one?

Well I am surprised that it is working at all. the problem is: signalr performs multiple requests until the connection is up and running. There is no guarantee that all requests go to the same VM. Especially if there is no session persistence enabled. I had a similar problem. You can activate session persistence in the Load Balancer but as you pointed out acting on OSI layer 4 will do this using the client ip (imagine all guys from the same office hitting your API using the same IP). In our project we use Azure Application Gateway which works with cookie affinity -> OSI Application layer. So far it seems to work as expected.

I think you misunderstand how the load balancer works. Every TCP connection must send all of its packets to the same destination VM and port. A TCP connection would not work if, after sending several packets, it then suddenly has the rest of the packets sent to another VM and/or port. So the load balancer makes a decision on the destination for a TCP connection once, and only once, when that TCP connection is being established. Once the TCP connection is setup, all its packets are sent to the same destination IP/port for the duration of the connection. I think you are assuming that different packets from the same TCP connection can end up at a different VM and that is definitely not the case.
So when your client creates a WebSocket connection the following happens. An incoming request for a TCP connection is received by the load balancer. It decides, determined by the distribution mode, which destination VM to send the request onto. It records this information internally. Any subsequent incoming packets for that TCP connection are automatically sent to the same VM because it looks up the appropriate VM from that internal table. Hence, all the client messages on your WebSocket will end up at the same VM and port.
If you create a new WebSocket it could end up at a different VM but all the messages from the client will end up at that same different VM.
Hope that helps.

On your Azure Load Balancer you'll want to configure Session persistence. This will ensure that when a client request gets directed to Server A, then any subsequent requests from that client will go to Server A.
Session persistence specifies that traffic from a client should be handled by the same virtual machine in the backend pool for the duration of a session. "None" specifies that successive requests from the same client may be handled by any virtual machine. "Client IP" specifies that successive requests from the same client IP address will be handled by the same virtual machine. "Client IP and protocol" specifies that successive requests from the same client IP address and protocol combination will be handled by the same virtual machine.

SignalR only knows the url you provided when starting the connection and it uses it to send requests to the server. I believe Azure App Service uses sticky sessions by default. You can find some details about this here: https://azure.microsoft.com/en-us/blog/azure-load-balancer-new-distribution-mode/
When using multiple servers and scale-out the client can send messages to any server.

Thank you for your answers guys.
Doing a bit of reading , it seems that azure load balancer use by default 5-touple distribution.
Here is the article https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
The problem with 5-touple is that is sticky per transport session .
And I think this is causing the client request using signalr to hit the same Vm in the scale set. I think the balancer interprets the signalr connection as a single transport session.
Application gateway wasn't an option from the beginning because it has many features which we do not need ( so it doesn't make sense to pay for something we don't use).
But now it seems that application gateway is the only balancer in azure capable of doing round robin when balancing traffic.

Related

Azure Load Balancing Solutions. Direct Traffic to Specific VMs

We are having difficulties choosing a load balancing solution (Load Balancer, Application Gateway, Traffic Manager, Front Door) for IIS websites on Azure VMs. The simple use case when there are 2 identical sites is covered well – just use Azure Load Balancer or Application Gateway. However, in cases when we would like to update websites and test those updates, we encounter limitation of load balancing solutions.
For example, if we would like to update IIS websites on VM1 and test those updates, the strategy would be:
Point a load balancer to VM2.
Update IIS website on VM1
Test the changes
If all tests are passed then point the load balancer to VM1 only, while we update VM2.
Point the load balancer to both VMs
We would like to know what is the best solution for directing traffic to only one VM. So far, we only see one option – removing a VM from backend address pool then returning it back and repeating the process for other VMs. Surely, there must be a better way to direct 100% of traffic to only one (or to specific VMs), right?
Update:
We ended up blocking the connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.

Without any Azure specifics, the usual pattern is to point a load balancer to a /status endpoint of your process, and to design the endpoint behavior according to your needs, eg:
When a service is first deployed its status is 'pending"
When you deem it healthy, eg all tests pass, do a POST /status to update it
The service then returns status 'ok'
Meanwhile the load balancer polls the /status endpoint every minute and knows to mark down / exclude forwarding for any servers not in the 'ok' state.
Some load balancers / gateways may work best with HTTP status codes whereas others may be able to read response text from the status endpoint. Pretty much all of them will support this general behavior though - you should not need an expensive solution.

We ended up blocking connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.

I had exactly the same requirement in an Azure environment which I built a few years ago. Azure Front Door didn't exist, and I had looked into using the Azure API to automate the process of adding and removing backend servers the way you described. It worked sometimes, but I found the Azure API was unreliable (lots of 503s reconfiguring the load balancer) and very slow to divert traffic to/from servers as I added or removed them from my cluster.
The solution that follows probably won't be well received if you are looking for an answer which purely relies upon Azure resources, but this is what I devised:
I configured an Azure load balancer with the simplest possible HTTP and HTTPS round-robin load balancing of requests on my external IP to two small Azure VMs running Debian with HAProxy. I then configured each HAProxy VM with backends for the actual IIS servers. I configured the two HAProxy VMs in an availability set such that Microsoft should not ever reboot them simultaneously for maintenance.
HAProxy is an excellent and very robust load balancer, and it supports nearly every imaginable load balancing scenario, and crucially for your question, it also supports listening on a socket to control the status of the backends. I configured the following in the global section of my haproxy.cfg:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats socket ipv4#192.168.95.100:9001 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
In my example, 192.168.95.100 is the first HAProxy VM, and 192.168.95.101 is the second. On the second server, these lines would be identical except for its internal IP.
Let's say you have an HAProxy frontend and backend for your HTTPS traffic to two web servers, ws1pro and ws2pro with the IPs 192.168.95.10 and 192.168.95.11 respectively. For simplicity sake, I'll assume we don't need to worry about HTTP session state differences across the two servers (e.g. Out-of-Process session state) so we just divert HTTPS connections to one node or the other:
listen stats
bind *:8080
mode http
stats enable
stats refresh 10s
stats show-desc Load Balancer
stats show-legends
stats uri /
frontend www_https
bind *:443
mode tcp
option tcplog
default_backend backend_https
backend backend_https
mode tcp
balance roundrobin
server ws1pro 192.168.95.10:443 check inter 5s
server ws2pro 192.168.95.11:443 check inter 5s
With the configuration above, since both HAProxy VMs are listening for admin commands on port 9001, and the Azure load balancer is sending the client's requests to either VM, we need to tell both servers to disable the same backend.
I used Socat to send the cluster control commands. You could do this from a Linux VM, but there is also a Windows version of Socat, and I used the Windows version in a set of really simple batch files. The cluster control commands would actually be the same in BASH.
stop_ws1pro.bat:
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
start_ws1pro.bat:
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
These admin commands execute almost instantly. Since the HAProxy configuration above enables the stats page, you should be able to watch the status change happen on the stats page as soon as it refreshes. The stats page will show the connections or sessions draining from the server you disabled over to the remaining enabled servers when you disable a backend, and then show them returning to the server once it is enabled again.

Why is azure load balancer still sending traffic to nodes after health probe down?

I have 2 Azure VM sitting behind a Standard Azure Load Balancer.
The load balancer has a healthprobe pinging every 5 seconds with HTTP on /health for each VM.
Interval is set to 5, port is set to 80 and /health, and "unhealthy threshold" is set to 2.
During deployment of an application, we set the /health-endpoint to return 503 and then wait 35 seconds to allow the load balancer to mark the instance as down, and so stop sending new traffic.
However, Load balancer does not seem to fully take the VM out of load. It still sends traffic inbound to the down instance, causing downtime for our customers.
I can see in IIS-logs that the /health-endpoint is indeed returning 503 when it should.
Any ideas whats wrong? Can it be some sort of TCP keep-alive?

I got confirmation from microsoft that this is working "as intended", which makes the Azure Load Balancer a bad fit for web applications. This is the answer from Microsoft:
I was able to discuss your observation with the internal team.
They explained that the Load balancer does not currently have
“Connection Draining” feature and would not terminate existing
connections.
Connection Draining is available with the Application Gateway
Connection Draining.
I heard this is being planning for the Load balancer also as future
Road map . You could also add your voice to the request for this
feature for the Load balancer by filling the feedback Form.

Load Balancer is a pass through service which does not terminate existing TCP connections where the flow is always between the client and the VM's guest OS and application. If a backend endpoint's health probe fails, established TCP connections to this backend endpoint continue, but it will stop sending new flows to the respective unhealthy instance. This is by design to give you opportunity to gracefully shutdown from the application to avoid any unexpected and sudden termination of ongoing application workflow.
Also you may consider configuring TCP reset on idle https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset to reduce number of idle connections.

I would suggest you the following approach
You could have to place a healthcheck.html page on each of your VM's. As long as the probe can retrieve the page, the load balancer will keep sending user requests to the VM.
When you do the deployment, simply rename the healthcheck.html to a different name such as _healthcheck.html. This will cause the probe to start receiving HTTP 404 errors and will take that machine out of the load balanced rotation.
After your deployment have been completed, rename _healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending requests to this VM again.
Thanks,
Manu

Firebase connection with multiple Node.js behind AWS/ALB load balancer and Nginx

I need to connect Firebase to a Node setup on AWS/Elastic Beanstalk. There are 1-4 Node servers, behind an ALB load balancer and an Nginx proxy. Firebase uses WSS protocol (hence the need for ALB, because the regular ELB does not support sockets). When a Node instance authenticates with Firebase it gets a socket the app can listen to.
My question: Since there could be any of the Node servers communicating with Firebase, how can the sockets be made sticky, so that regardless of which of the Node servers opened a socket, it will be the right socket for each communication?
Thanks!
ZE

You can enable Sticky sessions in AWS ALB for WSS protocol to work and send traffic to the same EC2 instance over a period of time.
Also note that you need to configure stickiness at Target Group level.
I created a 2nd Target Group call "xxxSocket" with Stickiness enabled,
and leaving the 1st Target Group "xxxHTTP" without Stickiness
(default). Finally, in my Application Load Balancer, I added a new
Rule to have "Path Pattern" = /socket.io then route to Target Group
"xxxSocket", leaving the default pattern route to "xxxHTTP".
Reference: AWS Forum Anyone gotten the new Application Load Balancer to work with websockets?
Also the WebSockets connections are inherently sticky.
WebSockets connections are inherently sticky. If the client requests a
connection upgrade to WebSockets, the target that returns an HTTP 101
status code to accept the connection upgrade is the target used in the
WebSockets connection. After the WebSockets upgrade is complete,
cookie-based stickiness is not used.
Reference: Target Groups for Your Application Load Balancers

gRPC client not using grpc-lb even with load-balancer address

I am trying to load-balance rpc calls to multiple node.js backends from single node.js client using an haproxy load-balancer with TCP-based load-balancing. I am providing this load-balancer's dns-name:port while creating grpc client, which according to https://github.com/grpc/grpc/blob/master/doc/load-balancing.md, should be treated as a load-balancer address and a subchannel should open with each of the lb's backend servers. But I can see that the only channel that opens is with the load-balancer and all the rpc are sent to a single server only, until tcp idle connection-timeout hits and a new tcp connection is setup with a new server.
Just wanted to ask how does grpc detects if the client is connecting with load-balancer or a server? And is there any way to tell the client that the address it is connecting to is a load-balancer address, and hence it should use grpc-lb policy instead of pick-first?

I later understood that for load-balancing policy to figure out if the address fed is of load-balancer or not, it needs an additional boolean indicating if it is an lb-address or not. So I needed to set SRV records for my load-balancer address to have this additional information required by load-balancing policy of grpc, according to this https://github.com/grpc/proposal/blob/master/A5-grpclb-in-dns.md

Directly accessing Azure workers; bypassing the load balancer

Typically, access to Azure workers is done via endpoints that are defined in the service definition. These endpoints, which must be TCP or HTTP(S), are passed through a load balancer and then connected to the actual IP/port of the Azure machines.
My application would benefit dramatically from the use of UDP, as I'm connecting from cellular devices where bytes are counted for billing and the overhead of SYN/ACK/FIN dwarfs the 8 byte packets I'm sending. I've even considered putting my data directly into ICMP message headers. However, none of this is supported by the load balancer.
I know that you can enable ping on Azure virtual machines and then ping them -- http://weblogs.thinktecture.com/cweyer/2010/12/enabling-ping-aka-icmp-on-windows-azure-roles.html.
Is there anything preventing me from using a TCP-based service (exposed through the load balancer) that would simply hand out an IP address and port of an Azure VM address, and then have the application communicate directly to that worker? (I'll have to handle load balancing myself.) If the worker gets shut down or moved, my application will be smart enough to reconnect to the TCP endpoint and ask for a new place to send data.
Does this concept work, or is there something in place to prevent this sort of direct access?

You'd have to run your own router which exposes an input (external) endpoint and then routes to an internal endpoint of your service, either on the same role or a different one (this is actually how Remote Desktop works). You can't directly connect to a specific instance by choice.
There's a 2-part blog series by Benjamin Guinebertière that describes IIS Application Request Routing to provide sticky sessions (part 1, part 2). This might be a good starting point.
Ryan Dunn also talked about http session routing on the Cloud Cover Show, along with a follow-up blog post.
I realize these two examples aren't exactly what you're doing, as they're routing http, but they share a similar premise.

There's a thing called InstanceInputEndpoint which you can use for defining ports on the public IP which will be directed to a local port on a particular VM instance. So you will have a particular port+IP combination which can directly access a particular VM.
<InstanceInputEndpoint name="HttpInstanceEndpoint" protocol="tcp" localPort="80">
<AllocatePublicPortFrom>
<FixedPortRange max="8089" min="8081" />
</AllocatePublicPortFrom>
</InstanceInputEndpoint>
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/gg557552.aspx

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string