NestJS on EC2 keeps responding with 502 randomly

NestJS on EC2 keeps responding with 502 randomly - node.js

I made a server with NestJS and ran it on an EC2 instance, and it is responding with 502 randomly.
I googled and a lot of articles mentioned the difference between the timeout interval of ALB and the server.
when I made a simple webpage server with express the same issue happened but fixed it by adjusting keepAliveTimeout and headersTimeout.
(each value was 65000ms and 67000ms)
however, this nestjs server still responds with 502 and I don't know what else to do.
tried to change the timeout of ALB to 30sec and to 120sec and this didn't work either.
what else is the possibility of this issue? I am lost.

Related

504 error on nodejs sails API server randomly across all endpoints

we have sails js as an API server and intermittently we are facing 504 upstream timed out the issue, the call was going till Nginx from there its throwing upstream timeout, there is no log coming on the application server, and it is happening across all the APIs intermittently not all the time, and the traffic on server is very low i.e. 5-10 requests per minute and there is no heavy computation going on, so not sure how can I debug this issue. this is a very random issue. also, there are No server restarts or any other errors on application logs. it's running fine on PM2. we are using an AWS EC2 instance. current timeout is 60 seconds but none of our APIs takes more than 500 milliseconds. we are using the node 6.6 version as it is a legacy monolith app so can not upgrade it due to multiple dependencies and no single owner. and requests are passing through the load balancer to NGINX but some time does not reach the application server. also, the instance size is quite bigger in terms of CPU and memory and traffic is extremely low. this is a very random behavior not specific to API. sometimes it can happen 1 out of 10 sometimes 2 out of 5 requests can through a gateway timeout issue.
some of the logs from Nginx are below-
[error] 31688#31688: *38998779 connect() failed (110: Connection timed out) while connecting to upstream, client 10.X.X.X
2022/04/22 16:36:37 [error] 31690#31690: *38998991 connect() failed (110: Connection timed out) while connecting to upstream, client: <server_ip>, server: <DNS_URL>
guys I have tried almost all the things from StackOverflow but nothing helping me, so please help me to find the root cause so I can mitigate the issue

Socket.io connecting/disconnecting unexpectedly in production - MEAN stack hosted on Elastic Beanstalk

I have a MEAN stack application hosted on AWS Elastic Beanstalk that uses socket.io
This socket.io is connecting/disconnecting unexpectedly.
The socket causes a code of 4xx to the server. After a few 4xx are sent to AWS, the server degrades automatically and the socket disconnects. So it becomes like a loop, AWS first receives the 4xx, then the environment becomes unhealthy, then the socket behaves even more strange because the server is about to go down, etc.
What matters is that the starting point is the 4xx caused by the socket.
The log I have says:
"GET /socket.io/?EIO=3&transport=polling&t=Nx_u0FB HTTP/1.1" 400 62
I tried to add the CORS option to the socket with the app domain as the origin, but it didn't help.
Please note that this happens in production only and not on localhost.
Also, please note that this case doesn't happen unless we have like 10-20 users/sockets connecting from different parts of the world.
If we have a few sockets, it rarely happens, and sometimes even if we have many users connecting from the same country, it doesn't happen. The behavior is very random.
Anyone can help with this?

AWS Load Balancer 502 Bad Gateway

I have microservices written in node/express hosted on EC2 with an application load balancer.
Some users are getting a 502 even before the request reaches the server.
I register every log inside each instance, and I don't have the logs of those requests, I have the request immediately before the 502, and the requests right after the 502, that's why I am assuming that the request never reaches the servers. Most users solve this by refreshing the page or using an anonymous tab, which makes the connection to a different machine (we have 6).
I can tell from the load balancer logs that the load balancer responds almost immediately to the request with 502. I guess that this could be a TCP RST.
I had a similar problem a long time ago, and I had to add keepAliveTimeout and headersTimeout to the node configuration. Here are my settings (still using the LB default of the 60s):
server.keepAliveTimeout = 65000;
server.headersTimeout = 80000;
The metrics, especially memory and CPU usage of all instances are fine.
These 502 errors started after an update we made where we introduced several packages, for instance, axios. At first, I thought it could be axios, because the keep-alive is not enabled by default. But it didn't work. Other than the axios, we just use the request.
Any tips on how should I debug/fix this issue?

HTTP 502 errors are usually caused by a problem with the load balancer. Which would explain why the requests are never reaching your server, presumably because the load balancer can't reach the server for some or other reason.
This link has some hints regarding how to get logs from a classic load balancer. However, since you didn't specify, you might be using an application load balancer, in which case this link might be more useful.

From the ALB access logs I knew that either the ALB couldn't connect the target or the connection was being immediately terminated by the target.
And the most difficult part was figure out how to replicate the 502 error.
It looks like the node version I was using has a request header size limit of 8kb. If any request exceeded that limit, the target would reject the connection, and the ALB would return a 502 error.
Solution:
I solved the issue by adding --max-http-header-size=size to the node start command line, where size is a value greater than 8kb.

A few common reasons for an AWS Load Balancer 502 Bad Gateway:
Be sure to have your public subnets (that your ALB is targeting) are set to auto-assign a public IP (so that instances deployed are auto-assigned a public IP).
Security group for your alb allows http and/or https traffic from the IPs that you are connecting from.

I was also Having the same problem from 1 or 2 months something like that and I didn't found the solution. And I was also having AWS Premium support but they were also not able to find the solution. I was getting 502 Error randomly loke may be 10 times per day. Finally after reading the docs from AWS
The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer.
https://aws.amazon.com/premiumsupport/knowledge-center/elb-alb-troubleshoot-502-errors/
SOLUTION:
I was running "Apache" webserver in EC2 so Increased "KEEPALIVETIMEOUT=65". This did the trick. For me.

Socket.io - invalid HTTP status code on SOME browsers

Just after a few weeks of working fine, our Socket.io started spewing errors on some browsers. I've tried updated to the latest Socket.io version, I've tried our setup on different machines, I've tried all sorts of machines, it seems to work on most browsers with no clear pattern of which work.
These errors appear on a second interval:
OPTIONS https://website.com/socket.io/?EIO=2&transport=polling&t=1409760272713-52&sid=Dkp1cq0lpKV75IO8AdA3 socket.io-1.0.6.js:2
XMLHttpRequest cannot load https://website.com/socket.io/?EIO=2&transport=polling&t=1409760272713-52&sid=Dkp1cq0lpKV75IO8AdA3. Invalid HTTP status code 400
We're behind Amazon's ELB, Socket.io on polling because the ELB router doesn't support WebSockets.

I found the problem that has been causing this, and it's is really unexpected...
This problem comes from using load balanced services like AWS ELB (independent EC2 should be fine though) and Heroku, their infrastructure doesn't support Socket.io features fully. AWS ELB flat out won't support WebSockets, and Heroku's router is trash for Socket.io, even in conjunction with socket.io-redis.
The problem is hidden when you use a single server, but as soon you start clustering, you will get issues. A single Heroku dyno on my application worked fine, and then the problems started appearing in production out of development, when we weren't using more than one server. We tried on ELB with sticky-load balance and even then, we still had the same issues.
When socket.io returns 400 errors, in this case it was saying "This session doesn't exist and you never completed the handshake", because you completed the handshake on a different server in your cluster.
The solution for me was just dedicating an EC2 instance for my web app to handle Socket.io.

HTTP 502 Server Hangup. What are the causes?

We are querying a node.js based server in our company and of the servers every so often throws a http status 502 after exactly 2 minutes saying "Server Hangup". I saw a lot of questions being asked on stackoverflow but I couldn't find a definitive answer. The way to reproduce this issue is doing an HTTP POST request to the server. The GET requests are fine.
I try the exact same requests to other similar servers and I've never got a 502. I have read everything about what 502 means but I am not sure what could be wrong with the server? Maybe a tcpdump on that server can be helpful? Could it be the server has too many connections and its not freeing them up? I would like to get some context before I email the other team with the problem.
Any feedback is appreciated.
Thanks,
KA

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string