504 error on nodejs sails API server randomly across all endpoints - node.js

we have sails js as an API server and intermittently we are facing 504 upstream timed out the issue, the call was going till Nginx from there its throwing upstream timeout, there is no log coming on the application server, and it is happening across all the APIs intermittently not all the time, and the traffic on server is very low i.e. 5-10 requests per minute and there is no heavy computation going on, so not sure how can I debug this issue. this is a very random issue. also, there are No server restarts or any other errors on application logs. it's running fine on PM2. we are using an AWS EC2 instance. current timeout is 60 seconds but none of our APIs takes more than 500 milliseconds. we are using the node 6.6 version as it is a legacy monolith app so can not upgrade it due to multiple dependencies and no single owner. and requests are passing through the load balancer to NGINX but some time does not reach the application server. also, the instance size is quite bigger in terms of CPU and memory and traffic is extremely low. this is a very random behavior not specific to API. sometimes it can happen 1 out of 10 sometimes 2 out of 5 requests can through a gateway timeout issue.
some of the logs from Nginx are below-
[error] 31688#31688: *38998779 connect() failed (110: Connection timed out) while connecting to upstream, client 10.X.X.X
2022/04/22 16:36:37 [error] 31690#31690: *38998991 connect() failed (110: Connection timed out) while connecting to upstream, client: <server_ip>, server: <DNS_URL>
guys I have tried almost all the things from StackOverflow but nothing helping me, so please help me to find the root cause so I can mitigate the issue

Related

NestJS on EC2 keeps responding with 502 randomly

I made a server with NestJS and ran it on an EC2 instance, and it is responding with 502 randomly.
I googled and a lot of articles mentioned the difference between the timeout interval of ALB and the server.
when I made a simple webpage server with express the same issue happened but fixed it by adjusting keepAliveTimeout and headersTimeout.
(each value was 65000ms and 67000ms)
however, this nestjs server still responds with 502 and I don't know what else to do.
tried to change the timeout of ALB to 30sec and to 120sec and this didn't work either.
what else is the possibility of this issue? I am lost.

AWS Load Balancer 502 Bad Gateway

I have microservices written in node/express hosted on EC2 with an application load balancer.
Some users are getting a 502 even before the request reaches the server.
I register every log inside each instance, and I don't have the logs of those requests, I have the request immediately before the 502, and the requests right after the 502, that's why I am assuming that the request never reaches the servers. Most users solve this by refreshing the page or using an anonymous tab, which makes the connection to a different machine (we have 6).
I can tell from the load balancer logs that the load balancer responds almost immediately to the request with 502. I guess that this could be a TCP RST.
I had a similar problem a long time ago, and I had to add keepAliveTimeout and headersTimeout to the node configuration. Here are my settings (still using the LB default of the 60s):
server.keepAliveTimeout = 65000;
server.headersTimeout = 80000;
The metrics, especially memory and CPU usage of all instances are fine.
These 502 errors started after an update we made where we introduced several packages, for instance, axios. At first, I thought it could be axios, because the keep-alive is not enabled by default. But it didn't work. Other than the axios, we just use the request.
Any tips on how should I debug/fix this issue?
HTTP 502 errors are usually caused by a problem with the load balancer. Which would explain why the requests are never reaching your server, presumably because the load balancer can't reach the server for some or other reason.
This link has some hints regarding how to get logs from a classic load balancer. However, since you didn't specify, you might be using an application load balancer, in which case this link might be more useful.
From the ALB access logs I knew that either the ALB couldn't connect the target or the connection was being immediately terminated by the target.
And the most difficult part was figure out how to replicate the 502 error.
It looks like the node version I was using has a request header size limit of 8kb. If any request exceeded that limit, the target would reject the connection, and the ALB would return a 502 error.
Solution:
I solved the issue by adding --max-http-header-size=size to the node start command line, where size is a value greater than 8kb.
A few common reasons for an AWS Load Balancer 502 Bad Gateway:
Be sure to have your public subnets (that your ALB is targeting) are set to auto-assign a public IP (so that instances deployed are auto-assigned a public IP).
Security group for your alb allows http and/or https traffic from the IPs that you are connecting from.
I was also Having the same problem from 1 or 2 months something like that and I didn't found the solution. And I was also having AWS Premium support but they were also not able to find the solution. I was getting 502 Error randomly loke may be 10 times per day. Finally after reading the docs from AWS
The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer.
https://aws.amazon.com/premiumsupport/knowledge-center/elb-alb-troubleshoot-502-errors/
SOLUTION:
I was running "Apache" webserver in EC2 so Increased "KEEPALIVETIMEOUT=65". This did the trick. For me.

Is there any way to increase the timeout connection between Apache web server and tomcat

My Application getting timeout when I am trying to fetch data from sources available on my tomcat server. I can see DB query is the culprit because it is sending data in 100 seconds due to huge amount of data it is processing. My request getting timeout in 60 seconds which leads to below error
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request
Reason: Error reading from remote server
I am using mod_proxy to connect to tomcat servers from Apache web server. I tried increasing connctionTimeout to 90000 mili-seconds for SSL connector but still request getting timeout in 60 seconds. Is there anything I am missing to change so that I can increase my connection timeout.
I am using tomcat 9. Any leads will be very helpful. As I am stuck here from quite long.
Thanks
Found a way to define proxy timeout value in httpd.conf file. I added below two parameters
Timeout 300
ProxyTimeout 300
For more details you can refer here

CoucbDB Ubuntu 14 Azure Socket Connection Timeout error

I have couchdb running on a Linux Ubuntu 14.04 VM and a .net Web application running under Azure Web Apps. Under our ELMAH logging for the web application I keep getting intermittent errors:
System.Net.Sockets.SocketException
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [ipaddress]:5984
I've checked the CouchDB logs and there isn't a record of those requests so I don't believe it's hitting the CouchDB server, I can confirm this by looking at the web server logs on Azure and see the Error 500 response. I've also tried a tcpdump however with little success (another issue logging tcpdump to a separate disk keeps failing due to access denied)
We've previously ran CouchDB on a Windows VM with no issues so I wonder if the issue relates to the OS connection settings for tcp and timeouts
Anyone have any suggestions as to where to look or what immediately jumps to mind?

Node.js intermittent connection refused in development

When I run my node.js app in development I intermittently see connection refused an about every 2nd or 3rd request. I am not even sending the requests very quickly (about 1 per second). The requests should be completing very quickly as this is an express app with an end-point that is just checking if the content-type is set correctly. Is it likely that I am seeing the issue because I am not proxying the requests through nginx? Nginx would queue the requests; whereas not using nginx would mean that I am just hitting my node.js app directly. I don't see anything in my node.js app's logs that would indicate an error.

Resources