I have microservices written in node/express hosted on EC2 with an application load balancer.
Some users are getting a 502 even before the request reaches the server.
I register every log inside each instance, and I don't have the logs of those requests, I have the request immediately before the 502, and the requests right after the 502, that's why I am assuming that the request never reaches the servers. Most users solve this by refreshing the page or using an anonymous tab, which makes the connection to a different machine (we have 6).
I can tell from the load balancer logs that the load balancer responds almost immediately to the request with 502. I guess that this could be a TCP RST.
I had a similar problem a long time ago, and I had to add keepAliveTimeout and headersTimeout to the node configuration. Here are my settings (still using the LB default of the 60s):
server.keepAliveTimeout = 65000;
server.headersTimeout = 80000;
The metrics, especially memory and CPU usage of all instances are fine.
These 502 errors started after an update we made where we introduced several packages, for instance, axios. At first, I thought it could be axios, because the keep-alive is not enabled by default. But it didn't work. Other than the axios, we just use the request.
Any tips on how should I debug/fix this issue?
HTTP 502 errors are usually caused by a problem with the load balancer. Which would explain why the requests are never reaching your server, presumably because the load balancer can't reach the server for some or other reason.
This link has some hints regarding how to get logs from a classic load balancer. However, since you didn't specify, you might be using an application load balancer, in which case this link might be more useful.
From the ALB access logs I knew that either the ALB couldn't connect the target or the connection was being immediately terminated by the target.
And the most difficult part was figure out how to replicate the 502 error.
It looks like the node version I was using has a request header size limit of 8kb. If any request exceeded that limit, the target would reject the connection, and the ALB would return a 502 error.
Solution:
I solved the issue by adding --max-http-header-size=size to the node start command line, where size is a value greater than 8kb.
A few common reasons for an AWS Load Balancer 502 Bad Gateway:
Be sure to have your public subnets (that your ALB is targeting) are set to auto-assign a public IP (so that instances deployed are auto-assigned a public IP).
Security group for your alb allows http and/or https traffic from the IPs that you are connecting from.
I was also Having the same problem from 1 or 2 months something like that and I didn't found the solution. And I was also having AWS Premium support but they were also not able to find the solution. I was getting 502 Error randomly loke may be 10 times per day. Finally after reading the docs from AWS
The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer.
https://aws.amazon.com/premiumsupport/knowledge-center/elb-alb-troubleshoot-502-errors/
SOLUTION:
I was running "Apache" webserver in EC2 so Increased "KEEPALIVETIMEOUT=65". This did the trick. For me.
Related
I made a server with NestJS and ran it on an EC2 instance, and it is responding with 502 randomly.
I googled and a lot of articles mentioned the difference between the timeout interval of ALB and the server.
when I made a simple webpage server with express the same issue happened but fixed it by adjusting keepAliveTimeout and headersTimeout.
(each value was 65000ms and 67000ms)
however, this nestjs server still responds with 502 and I don't know what else to do.
tried to change the timeout of ALB to 30sec and to 120sec and this didn't work either.
what else is the possibility of this issue? I am lost.
I have a DO Load Balancer that has 4 servers behind it, I've been using socket.io with Sticky Sessions enabled in the Load Balancer settings and it had been working just fine for a while.
Recently clients have not been able to connect at all getting a 400 error immediately on connection. I haven't changed anything in the way I connect to the sockets at all. If I do require that the transport be 'websocket' only from the client it does connect successfully, but then I lose out on the polling backup (one of the main benefits of socket.io).
Also, connecting directly to one of the droplets works as expected, so the issue definitely stands with the Load Balancer.
Does anyone have any idea as to any kind of set up that should be in place for this to work with the DO Load Balancers? Anything that might have changed recently?
I'm running socket.io on a NodeJS server with Express if that helps at all.
Edit #1: Added a screenshot of the LB Settings
I'm running a nodejs webserver on azure using the express library for http handling. We've been attempting to enable cloudflare protection on the domains pointing to this box, but when we turn cloudflare proxying on, we see cycling periods of requests succeeding, and requests failing with a 524 error. I understand this error is returned when the server fails to respond to the connection with an HTTP response in time, but I'm having a hard time figuring out why it is
A. Only failing sometimes as opposed to all the time
B. Immediately fixed when we turn cloudflare proxying off.
I've been attempting to confirm the TCP connection using
tcpdump -i eth0 port 443 | grep cloudflare (the request come over https) and have seen curl requests fail seemingly without any traffic hitting the box, while others do arrive. For further reference, these requests should be and are quite quick when they succeed, so I'm having a hard time believe the issue is due to a long running process stalling the response.
I do not believe we have any sort of IP based throttling or firewall (at least not intentionally?)
Any ideas greatly appreciated, thanks
It seems that the issue was caused by DNS resolution.
On Azure, you can configure a custom domain name for your created webapp. And according to the CloudFlare usage, you need to switch the DNS resolution to CloudFlare DNS server, please see more infomation for configuring domain name https://azure.microsoft.com/en-us/documentation/articles/web-sites-custom-domain-name/.
You can try to refer to the faq doc of CloudFlare How do I enter Windows Azure DNS records in CloudFlare? to make sure the DNS settings is correct.
Try clearing your cookies.
Had a similar issue when I changed cloudflare settings to a new host but cloudflare cookies for the domain was doing something funky to the request (I am guessing it might be trying to contact the old host?)
I am having a trouble using HTTP POST when cloudflare is enabled.
It keeps returning 524 timeout.
Failed to load resource: the server responded with a status of 524 (Origin Time-out)
But when I disabled cloudflare, the HTTP POST works fine.
Any idea what might caused this?
UPDATE
I am using AJAX POST, does this got anything to do with ajax?
Thanks.
General causes for a CloudFlare 524 error.
Support should be able to provide more detailed troubleshooting.
Console utility "netstat" shows that some connections from CloudFlare are in CLOSE_WAIT state. Pointing that server just sits without correctly closing connections. Looking to the TCP traffic of my web server with Message Analyzer, I found several connections that was established and http request was sent but that wasn't ever processed by my server.
So we get an answer: the number of simultaneously established connections outnumbered available Accept() calls. So TCP stack connects and wait while application will handle it's connection. Depending on the situation this can never happen, so the client side just drops this connection after a 30 sec timeout without getting any response.
To fix this, you must increase the number of outstanding possible accepts. This parameter can be named as "Max simultaneous connections number" or something similar. Check your web server documentation \ ask the support to find it out.
Also, as an experiment, you can force your server to reply with the "Connection:close" header to each request. This may prevent reaching the active connections limit problem because CloudFlare keep-alive them just way too long.
Also, the more simultaneous requests you do, the more probability to get in troubles. You can try to set some small webserver-side timeout for idle connections.
P.S.: Illustration of CloudFlare's connections number after one client loaded a page:
(http://i.imgur.com/IgwGLCf.png)
Running ColdFusion on IIS every request that runs for more than 60 seconds flushes the browser with a blank page.
Ive tried changing every setting that might affect this and its still happening. I'm out of ideas other than posting here, im not sure if its IIS or ColdFusion timing out.
I worked it out, its not IIS or ColdFusion, its the AWS Load Balancer. If I bypass that it works fine.
In our case, too, it was the load balancer causing the issue, not IIS. Also this resulted in ASP.NET scripts being run twice when the load balancer timed out. It was trying once more to get a result after it timed out the first time. Accessing the scripts via a whitelisted "direct" server IP address avoiding the load balancer fixed the problem.
Also, FYI, the timeouts we were setting manually were effective again, and Response.Flush() started working again. It could well be that beyond the load balancer some caching servers got involved adding to the problem.