Clients disconnect when new instance added - azure

When I increase the number of instances from 1 to 2 in my application in Scale Set, my connection on the Application Gateway is disconnected and I have to refresh the page. This works fine on AWS and others. Do you have any comments about it?

Related

Lambda lost connection to RDS at 01:00 2019-01-12 (EU/London)

I have a set of lambda functions that processes messages on an SQS stack. They take data sets, process them and store the results in an RDS MySQL database, which it connects to via VPC. Both the Lambda functions and the RDS database are in the same availability zone.
This has been working for the last couple of months without any issues, but early this morning (2019-01-12) at 01:00 I started seeing lambda timeouts and messages being moved into the dead letter queue.
I've done some troubleshooting and confirmed the reason for the timeouts is the inability for Lambda to establish a connection to the database server.
The RDS server is public, but locked down to allow access only through VPC and 2 public IPs.
I've taken the following steps so far to try and resolve the issue:
Given the lambda service role admin rights to rule out IAM issues
Unassigned VPC from the lambda functions and opened up RDC inbound access from 0.0.0.0/0 to rule out VPC issues.
Restarted the RDS hosts, the good ol' off'n'on again.
Used serverless to invoke the lambda functions locally with test data (worked). My local machine connects to the public RDS IP, not through VPC.
Changed the runtime environment from 3.6 to 3.7
It doesn't appear to be a code issue, as it's been working flawlessly for the past couple of months and I can invoke locally without issue and my Elastic Beanstalk instance, which sits on the same VPC subnet continues to connect through VPC without issue.
Here's the code I'm using to connect:
connectionString = 'mysql+pymysql://{0}:{1}#{2}/{3}'.format(os.environ['DB_USER'], os.environ['DB_PASSWORD'], os.environ['DB_HOST'], os.environ['DB_SCHEMA'])
engine = create_engine(connectionString, poolclass=NullPool)
with engine.connect() as con: <--- breaking here
meta = MetaData(engine, reflect=True) <-- never gets to here
I double checked the connection string & user accounts, both are correct/working locally.
If someone could point me in the right direction, I'd be grateful!
My first guess is that you've hit a connection limit on the RDS database. Because Lambdas can be executed concurrently (this could easily be the case if there were suddenly a lot of messages in your SQS queue), and each execution opens a new connection to your DB, the connection pool can get saturated.
If this is the case, you can set a concurrent execution limit on your Lambda function to prevent this.
A side note - it is not recommended to use a database with a persistent connection in a serverless architecture exactly for this reason. AFAIK, AWS is working on a better solution to use RDS from Lambda, but it's not available yet.
So...
I was changing security groups and it was having no effect on the RDS host, at one point I removed all access and I could still connect, which is crazy. At this point I started to think the outage on Friday night put the underlying RDS host into a weird state. I put the Security Groups back to the way they should be, stopped & started (restart had no effect) the RDS host and everything started to work again.
Very frustrating, but happy it's finally resolved.

Azure Http connection gets interrupted after 5 minutes

We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?

AWS application load balancer and socket.io

I have a socket.io chat room running whose traffic is getting larger as we are running on one machine. We have ran benchmarks using the ws library for sockets and they do perform much better which would better utilize our hardware. This would come at a cost of having to rewrite our application though.
Our socket.io application allows users to create private chat rooms which are implemented by using namespaces. E.g
localhost:8080/room/1
localhost:8080/room/2
localhost:8080/room/3
When everything is in one instance it is quite easy, but now we are looking to expand this capacity into multiple nodes.
We run this instance in Amazon's cloud. Previously it looked like scaling websockets was an issue with ELBs. We have noticed that Amazon now supports and application load balancer which supports websockets. This sounds great but after reading the documentation I must admit I don't really know what it means. If I am using socket.io with thousands of namespaces do I just put instances behind this ALB and everything will work?My main questions is:
If x number of users join a namespace, will the ALB automatically redirect my messages to and from the proper users? So let's say I have 5 vanilla socket.io instance running behind the ALB. User 1 creates a namespace. Few hours later pass and User 99999 comes and wants to join this namespace, will there need to be any additional code written to do this or will the alb redirect everything where it should go? The same goes for sending and receiving messages?
While ALB will load balance the users correctly, you will need to adapt your code a little since users that joined a specific room will be dispersed throughout different servers.
In their documentation socket.io provides a way to do this:
Now that you have multiple Socket.IO nodes accepting connections, if
you want to broadcast events to everyone (or even everyone in a
certain room) you’ll need some way of passing messages between
processes or computers.
The interface in charge of routing messages is what we call the
Adapter. You can implement your own on top of the socket.io-adapter
(by inheriting from it) or you can use the one we provide on top of
Redis: socket.io-redis:
var io = require('socket.io')(3000);
var redis = require('socket.io-redis');
io.adapter(redis({ host: 'localhost', port: 6379 }));
ALB setup
I would recommend to enable sticky session in your ALB, otherwise socket.io handshake will fail when using a non-websocket transport, such as long polling, since handshaking task using this transports requires more than one request, and you need all of those requests to be performed against the same server.
Alternative using ALB Routing without socket.io adapter.
If I wanted to avoid having a redis database. For example, if my rooms are created by users, if userA creates a room
on instance 4, if another user wants to join this room, how would they
know which instance it is on? Would I need the adapter here too?
The goal of this alternative is to have each room assigned to a specific EC2 Instance. We're going to achieve this using ALB Routing
N rooms > 1 instance.
Step 1:
You will need to change your rooms URL to something like:
/i1/room/550
/i1/room/20
/i2/room/5
/i5/room/492
being:
/{instance-number}/room/{room-id}
This is needed so ALB can route each room to a specific instance.
Step 2:
Create N target groups (N being the number of instances you have at the moment)
Step 3:
Register each instance to each target group
Target Groups > Instance X target group > Target tab > Edit > Choose instance X > add to registered
Target group X > EC2 Instance X
Target group Y > EC2 Instance Y
Step 4:
Edit ALB target rules
Load Balancers > Your ALB > Listeners > View/Edit Rules
Step 5:
Create one rule per target group/instance with following settings:
IF > Path: /iX/room/*
THEN > forward to: instanceX
Once you have this setup when you enter:
/i1/room/550 you will be using EC2 Instance 1.
/i2/room/200 will be using EC2 Instance 2
and so on.
Now you will have to make your own logic in order to have the rooms balanced across your instances. You don't want to have one instance hosting almost all the groups.
I recommend the first approach since it can be autoscaled easily.

Azure WebSites / App Service Unexplained 502 errors

We have a stateless (with shared Azure Redis Cache) WebApp that we would like to automatically scale via the Azure auto-scale service. When I activate the auto-scale-out, or even when I activate 3 fixed instances for the WebApp, I get the opposite effect: response times increase exponentially or I get Http 502 errors.
This happens whether I use our configured traffic manager url (which worked fine for months with single instances) or the native url (.azurewebsites.net). Could this have something to do with the traffic manager? If so, where can I find info on this combination (having searched)? And how do I properly leverage auto-scale with traffic-manager failovers/perf? I have tried putting the traffic manager in both failover and performance mode with no evident effect. I can gladdly provide links via private channels.
UPDATE: We have reproduced the situation now the "other way around": On the account where we were getting the frequent 5XX errors, we have removed all load balanced servers (only one server per app now) and the problem disappeared. And, on the other account, we started to balance across 3 servers (no traffic manager configured) and soon got the frequent 502 and 503 show stoppers.
Related hypothesis here: https://ask.auth0.com/t/health-checks-response-with-500-http-status/446/8
Possibly the cause? Any takers?
UPDATE
After reverting all WebApps to single instances to rule out any relationship to load balancing, things ran fine for a while. Then the same "502" behavior reappeared across all servers for a period of approx. 15 min on 04.Jan.16 , then disappeared again.
UPDATE
Problem reoccurred for a period of 10 min at 12.55 UTC/GMT on 08.Jan.16 and then disappeared again after a few min. Checking logfiles now for more info.
UPDATE
Problem reoccurred for a period of 90 min at roughly 11.00 UTC/GMT on 19.Jan.16 also on .scm. page. This is the "reference-client" Web App on the account with a Web App named "dummy1015". "502 - Web server received an invalid response while acting as a gateway or proxy server."
I don't think Traffic Manager is the issue here. Since Traffic Manager works at the DNS level, it cannot be the source of the 5XX errors you are seeing. To confirm, I suggest the following:
Check if the increased response times are coming from the DNS lookup or from the web request.
Introduce Traffic Manager whilst keeping your single instance / non-load-balanced set up, and confirm that the problem does not re-appear
This will help confirm if the issue relates to Traffic Manager or some other aspect of the load-balancing.
Regards,
Jonathan Tuliani
Program Manager
Azure Networking - DNS and Traffic Manager

Azure Auto Scaling works but load doesn´t get distributed

i have a problem with auto scaling in azure. The scaling process works fine but when a new instance is added it becomes no traffic.
My scenario:
I have 2 running instances whit a WCF webservice on it. Now i shot from 2 other servers(not azure) data to the webservice.
After a while the auto scaling kicks in and a new instance is added. The 2 servers are producing still load on the first 2 azure servers. However the new one doesn´t get any.
I thought azure is using round robin for load balancing or am i missing sth. else?
Thx for any help.
The problem is because of TCP connection keep-alive - when the clients first connect the connection is established to existing instances and then it persists to those instances. So when the service scales out the clients won't reconnect unless the connection is broken. New clients will connect to both existing and new instances.
Here's another question for a very similar scenario. For testing purposes you can just disable keep-alive to ensure that load is indeed distributed between instances.

Resources