Kubernetes drops HTTP connection initialized by node.js / postgres

Kubernetes drops HTTP connection initialized by node.js / postgres - node.js

I have a very simple piece of code written in node.js which runs on Kubernetes and AWS. The app just does POST/GET request to create and get data from other services. service1-->service2->service3
Service1 get post request and call service2, service2 calls postgres DB (using sequlize) and create a new row and then call service3, service3 get data from the DB and returns the response to service2, service2 returns the response to service1.
Most of the times it works, but once in 4-5 attempts + concurrency, it dropped and I got a timeout. the problem is that the service1 receives the response back (according to the logs and network traces) but it seems that the connection was dropped somewhere between the services and I got a timeout (ESOCKETTIMEDOUT).
I've tried to use to replace request.js with node-fetch
I've tried to use NewRelic/Elastic APM
I've tried to use node -prof and analyze it with node --prof-process with no conclusions.
Is it possible Kubernetes drops my connection?

Hard to tell without debugging but since some connections are getting dropped when you add more load + concurrency it's likely that you need more replicas on your Kubernetes deployments and possibly adjusts the Resources on your container pod specs.
If this turns out to be the case you can also configure an HPA (Horizontal Pod Autoscaler) to handle your load.

Related

How does Application Gateway prevent requests being sent to recently terminated pods?

I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?

Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)

AWS EBS runs into "504 Gateway Time-out"

I'm new to using AWS EBS and ECS, so please bear with me if I ask questions that might be obvious for others. To the issue:
I've got a single-container Node/Express application that runs on EBS. The local docker container works as expected. On EBS, I can access one endpoint of the API and get the expected output. For the second endpoint, which runs longer (around 10-15 seconds) I get no response and run after 60 seconds into a time out: "504 Gateway Time-out".
I wonder how I would approach debugging this as I can't connect to the container directly? Currently there isn't any debugging functionality in the code included either as I'm not sure what the best node approach for a EBS container is - any recommendations are highly appreciated.
Thank you in advance!

You can see the EC2 instances running on EBS in your AWS, and you can choose to give them IP addresses in your EBS options. That will let you SSH directly into them if you need to.
Otherwise check the keepAliveTimeout field in your server (the value returned by app.listen() of you're using express).
I got a decent number of 504s when my Node server timeout was less than my load balancer timeout.

Your application takes longer than expected (> 60 seconds) to respond, so either nginx or the Load Balancer terminates your request.
See my answer here

AppEngine nodejs app sporadically sends 502s and restarts

We have a nodejs app that gets successfully deployed to a standard environment. Something happens after about two hours (or sooner depending on traffic): our downstream clients start receiving a bunch of 502 responses and then the service stabilizes. We think this has been happening for at least a few months.
When investigating the cause of the 502s, I see that:
There are no unhandled exception/promise rejection logs to indicate that the node app has crashed
I console.log when receiving SIGTERM and that, too, does not appear in the logs
The logs of the nginx sidecar include the following:
2020/06/16 23:11:11 [error] 35#35: *1149 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /api/redacted HTTP/1.1", upstream: "http://127.0.0.1:8081/api/redacted", host: "redacted.appspot.com""
I'm assuming that the 502s are coming from nginx because the upstream has disappeared. Are there other explanations I should explore?
If GAE is replacing my app containers intentionally, shouldn't that process prevent these types of 502s?
Should I expect something other than SIGTERM to be sent by the environment when the application/container is getting replaced?
Update #1 (2020-06-22)
I investigated and found evidence that we might be exceeding memory quota so I changed our instance_class from F1 to F2. As I write this our instances are sitting at ~200M of memory usage (F2s have 512M available). Additionally, I use the --max-old-space-size switch to set nodes memory usage to 496M.
The 502s are still happening.
I suspect that the 502s are happening as a result of the autoscaler terminating instances. Our app never receives SIGTERM (even during deployments). That means I can't close http keepalive connections gracefully and might explain why nginx raises Connection reset by peer.
Update #2 (2020-06-24)
Our service is just standard REST type stuff, no heavy loops.
I'll post another update with some memory graphs but I don't see any spikes. Perhaps a small memory leak.
Here's our app.yaml:
service: redacted
runtime: nodejs12
instance_class: F2
handlers:
- url: /.*
secure: always
redirect_http_response_code: 301
script: auto

We had a very similar problem with our Node.js app deployed on App Engine Flexible.
In our case, we ultimately determined that we had memory pressure that was causing the Node.js garbage collector to sometimes delay the processing of a request for hundreds of milliseconds (sometimes more). This caused our health check URLs to sporadically timeout, prompting GAE to remove the instance from the active pool.
Because we typically had just two instances handling the steady traffic, removing one instance quickly overloaded the remaining instance, and it would soon suffer the same fate.
We were surprised to find that it could take two minutes or longer before App Engine assigned traffic to a newly-created instance. Between the time our original instances were declared unhealthy, and when new instance(s) were online, 502s would be returned (presumably by GAE's nginx) to the client.
We were able to stabilize the environment simply by adding:
automatic_scaling:
min_num_instances: 4
To our app.yaml. Because two instances were generally sufficient for the traffic, ensuring we always had four running apparently kept our memory usage low enough to prevent the GC from stalling request handling, and even if it did, we had enough excess capacity to handle one instance being removed.
The scaling settings for GAE standard are slightly different.
In retrospect, we could see that our latency/response times would get a little "jittery" before the real problems started. Most responses had typical response times ~30ms, but increasingly we would see outlier requests in the x00ms range. You may want to check your request logs to see if you see something similar.
New Relic's Node.js VM data was helpful in detecting that garbage collection was taking an increasing amount of time.

Usually, 502 messages are errors on nginx side, as you have mentioned. The detailed logs related to this errors are not surfaced to Cloud Logging, yet.
According to your behavior, it seems a workload, so we can relate this case to an issue with running out of resources.
There are somethings that are well worth to take a look:
Check your metrics. The memory and CPU usage should be under healthy limits.
Check whether your scaling metrics are being enough to your workload.
Is there a chance to share these metrics near to the restart event?
Also, i t would be goo if you share your resources and scaling in the app.yaml.

Azure Http connection gets interrupted after 5 minutes

We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?

We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.

So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?

RabbitMQ cluster fails when one node is not reachable

I created a RabbitMQ cluster via Docker and Docker Cloud. I am running two RabbitMQ container on two separate nodes (both hosted on AWS).
The output of rabbitmqctl cluster_status is:
Cluster status of node 'rabbit#rabbitmq-cluster-2' ...
[{nodes,[{disc,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']}]},
{running_nodes,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']},
{cluster_name,<<"rabbit#rabbitmq-cluster-1">>},
{partitions,[]}]
However, when I am stopping one container/node, then my messages cannot get delievered and get queued in .dlx
I am using senecajs with NodeJS.
Did anybody have the same problems and can point me into a direction?

To answer my own question:
The problem was that Docker, after starting, caches the DNS and is not
able to connect to a new one. So if one cluster fails, Docker still
tries to connect to the one, instead of trying a new one.
The solution was to write my own function when connecting to the RabbitMQ. I first check with net.createConnection if the host is online. If yes, I connect to it, if not I try a different one.
Every time a RabbitMQ node is down, my service fails, restarts and calls the "try this host" function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string