My Node.JS application is running on production server via forever daemon:
forever start -w --watchDirectory=/path/to/app \
--watchIgnore=/path/to/app/node_modules/** /path/to/app/server.js
When I change files contents in /path/to/app/ directory, the process is restarted by forever. While restart takes around 2-3 seconds, the app is unavailable and so downtime occurs every time I deploy a new change. How can I prevent the downtime assuming I have full access to the server?
You can do that manually by using an HTTP load balancer, so you going to create two or more backends that are accessible only by the load balancer (the load balancer is only one reachable by a public address). The next step is to update one server only, while the load balancer controls the traffic to one backend (the available one). After the successful update, you can turn on the updated one and redirect the load balancer to the right backend (the updated), repeat the procedure, and both should be updated without service downtime.
Related
I have microservices written in node/express hosted on EC2 with an application load balancer.
Some users are getting a 502 even before the request reaches the server.
I register every log inside each instance, and I don't have the logs of those requests, I have the request immediately before the 502, and the requests right after the 502, that's why I am assuming that the request never reaches the servers. Most users solve this by refreshing the page or using an anonymous tab, which makes the connection to a different machine (we have 6).
I can tell from the load balancer logs that the load balancer responds almost immediately to the request with 502. I guess that this could be a TCP RST.
I had a similar problem a long time ago, and I had to add keepAliveTimeout and headersTimeout to the node configuration. Here are my settings (still using the LB default of the 60s):
server.keepAliveTimeout = 65000;
server.headersTimeout = 80000;
The metrics, especially memory and CPU usage of all instances are fine.
These 502 errors started after an update we made where we introduced several packages, for instance, axios. At first, I thought it could be axios, because the keep-alive is not enabled by default. But it didn't work. Other than the axios, we just use the request.
Any tips on how should I debug/fix this issue?
HTTP 502 errors are usually caused by a problem with the load balancer. Which would explain why the requests are never reaching your server, presumably because the load balancer can't reach the server for some or other reason.
This link has some hints regarding how to get logs from a classic load balancer. However, since you didn't specify, you might be using an application load balancer, in which case this link might be more useful.
From the ALB access logs I knew that either the ALB couldn't connect the target or the connection was being immediately terminated by the target.
And the most difficult part was figure out how to replicate the 502 error.
It looks like the node version I was using has a request header size limit of 8kb. If any request exceeded that limit, the target would reject the connection, and the ALB would return a 502 error.
Solution:
I solved the issue by adding --max-http-header-size=size to the node start command line, where size is a value greater than 8kb.
A few common reasons for an AWS Load Balancer 502 Bad Gateway:
Be sure to have your public subnets (that your ALB is targeting) are set to auto-assign a public IP (so that instances deployed are auto-assigned a public IP).
Security group for your alb allows http and/or https traffic from the IPs that you are connecting from.
I was also Having the same problem from 1 or 2 months something like that and I didn't found the solution. And I was also having AWS Premium support but they were also not able to find the solution. I was getting 502 Error randomly loke may be 10 times per day. Finally after reading the docs from AWS
The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer.
https://aws.amazon.com/premiumsupport/knowledge-center/elb-alb-troubleshoot-502-errors/
SOLUTION:
I was running "Apache" webserver in EC2 so Increased "KEEPALIVETIMEOUT=65". This did the trick. For me.
We have a weird networking issue.
We have a Hyperledger Fabric client application written in Node.js running in Kubernetes which communicates with an external Hyperledger Fabric Network.
We randomly get timeout errors on this communication. When the pod is restarted, all goes good for a while then timeout errors start, sometimes randomly fixed on its own and then goes bad again.
This is Azure EKS, we setup a quick Kubernetes cluster in AWS with Rancher and deployed the app there and same timeout error happened there too.
We ran scripts in the same container all night long which hits the external Hyperledger endpoint both with cURL and a small Node.js script every minute and we didnt get even a single error.
We ran the application in another VM as plain Docker containers and there was no issue there.
We inspected the network traffic inside container, when this issue happens, we can see with netstat a connection is established but tcpdump shows no traffic, no packages are even tried to be sent.
Checking Hyperledger Fabric SDK code, it uses gRPC protocol buffers behind the scenes.
So any clues maybe?
This turned out to be not Kubernetes but dropped connection issue.
gRPC keeps connection open and after some period of inactivity intermediary components drop the connection. In Azure AKS case this is the load balancer, as every outbound connection goes through a load balancer. There is a non configurable idle timeout period of 4 minutes after which load balancer drops the connection.
The fix is configuring gRPC for sending keep alive messages.
Scripts in the container worked without a problem, as they open a new connection every time they run.
Application running as plain Docker containers didnt have this issue since we were hitting endpoints every minute hence never reaching idle timeout threshold. When we hit endpoints every 10 minutes, timeout issue also started there too.
I have a C# console application / Windows sevice that uses the HttpListener stuff to handle requests, IIS is setup to reverse proxy to this via ARR.
My problem is that when I update this application there is a short downtime between the old instance being shut down and the new one being ready.
The approach I'm thinking about would be to add 2 servers to the server farm via local hostnames with 2 ports and on update I'd start the new instance which would listen on the unused port, stop listening for new requests on the old instance and gracefully shut it down (ie process the current requests). Those last 2 steps would be started by the new instance to ensure that it is ready to handle the requests.
Is IIS ARR load balancing smart enough to try the other instance and mark the now shut down one as unavailable without losing any requests until the new one is updated or do I have to add health checks etc (would that again lead to a short downtime period?)
One idea that I believe could work (especially if your IIS is only being used for this purpose) is to leverage the IIS overlapped recycling capabilities that are built-in when you make a configuration change. In this case what you could do is:
start a new instance of your app running listening in a different
port,
edit the configuration in ARR to point to the new port.
IIS should allow any existing requests running in the application pool within the recycling timeout to drain successfully while new requests will be sent to the new application pool.
Maybe if you share a bit more on the configuration you are using in ARR (like a snippet of %windir%\system32\inetsrv\config\applicationHost.config and the webFarms section)
For deployments, I want to be able to quickly move traffic from one front end to another with out adding additional hardware. How can this be done?
I would think with a load balancer. In normal state you can let the load balancer decide which front end to use. When you want to deploy to frontend 1, you can derive all traffic to frontend 2. Then when frontend 1 is deployed you can do the same for frontend 2.
can't do it 'quickly' without a load balancer. Alternatively, and slower, would be to update the DNS for the site to point to the other web front end.
HTH
I will be running a dynamic web site and if the server ever is to stop responding, I'd like to failover to a static website that displays a "We are down for maintenance" page. I have been reading and I found that switching the DNS dynamically may be an option, but how quick will that change take place? And will everyone see the change immediately? Are there any better ways to failover to another server?
DNS has a TTL (time to live) and gets cached until the TTL expires. So a DNS cutover does not happen immediately. Everyone with a cached DNS lookup of your site still uses the old value. You could set an insanely short TTL but this is crappy for performance. DNS is almost certainly not the right way to accomplish what you are doing.
A load balancer can do this kind of immediate switchover. All traffic always hits the load balancer first which under normal circumstances proxies requests along to your main web server(s). In the event of web server crash, you can just have the load balancer direct all web traffic to your failover web server.
pound, perlbal or other software load-balancer could do that, I believe, yes
perhaps even Apache rewrite rules could allow this? I'm not sure if there's a way to branch when the dynamic server is not available, though. Customize Apache 404 response to your liking?
first of all is important understand which kind of failure you want failover, if it's app/db error and the server remain up you can create a script that do some checks and failover your website to another temp page. (changing apache config or .htaccess)
If is an hardware failover the DNS solution is ok but it's not immediate so you will lose some users traffic.
The best ideal solution is to use a proxy (like HAProxy) that forward the HTTP request to at least 2 webserver and automatically detect if one of those fail and switch over to the working one.
If you're using Amazon AWS you can use ELB - Elastic Load Balancer