I'm running a nestjs web application implemented with fastify on kubernetes.
I split my application into Multi Zones, and deploy it into different pyhsical location k8s clusters (Cluster A & Cluster B).
Everything gose well, except the Zone X in Culster A which has the maximum traffic during all zones.
( Here is a 2-Day metrics dashboard for Zone X during normal time )
The problem only happens on the Zone X in Cluster A and never happens on any other zones or clusters.
At first some 499 responses appear in Cluster A's Ingress Dashboard, and soon the memory of pods suddenly expand to the memory limit one pod after another.
It seems that the 499 status is caused by pods not sending responses to the outer.
At the same time, other zones in Cluster A work normally.
For avoiding influencing users, I switch all network traffic to Cluster B and everything work properly, Which excludes causing by dirty data.
I tried to kill and redeploy all pods of Zone X in Cluster A, but when I switch traffic back to Cluster A, the problem occurs again. But after waitting for 2-3 hours and then swith back the traffic, the problems disappers!
Since I don't konow how comes, only thing I can do is switching traffic and check is everything back to normal.
I've tried multiple variations of node memory issues, but none of them seems to cause this problem. Any ideas or inspirations of this problem?
Name
Version
nestjs
v6.1.1
fastify
v2.11.0
Docker Image
node:12-alpine(v12.18.3)
Ingress
v0.30.0
Kubernetes
v1.18.12
Related
We are in the process of transitioning from a self-managed kubernetes cluster to Google's GKE Autopilot. All of our incoming API requests are handled by an "API Gateway" server that routes requests to various internal services. When testing this API Gateway on the new GKE Autopilot cluster, we noticed sporadic EAI_AGAIN DNS resolution errors for the internal services.
These errors occur even at low load (50-100 requests per second), but appear to increase when the number of concurrent requests increases. They also occur when rolling out a new image to downstream pods, despite having multiple replicas and a rolling update strategy.
Our API Gateway is written in NodeJS. Researching online (1, 2), I found that the issue might be related to one of (i) overloading the kubernetes-internal DNS server, (ii) overloading the nodejs event loop since getaddrinfo is blocking, (iii) an issue with MUSL in alpine-based NodeJS images, or (iv) a race condition in earlier linux kernel versions.
All but (iv) can probably be ruled out in our case:
(i) kubedns is at very low CPU usage, and GKE Autopilot implements node-local DNS caching by default.
(ii) Our API Gateway is at low CPU usage. The event loop does appear to lag sporadically for up to 100ms, but there is not an overwhelming correlation to the EAI_AGAIN error rate.
(iii) We are running debian-based NodeJS images.
(iv) I'm not sure what linux kernel version GKE Autopilot pods are running on, but at our low load I don't think we should be hitting this error.
It is strange to me that we are seeing these errors given that our load is not high compared to what other companies are running on kubernetes. Does someone have any pointers for where to look further?
I as using AuroraDB cluster with 2 readers and pgBouncer to maintain a connection pool.
My application is very read intensive and fires a lot of select queries.
the problem I am facing is my 2 read replicas are not getting used completely in parallel.
I can see the trends where all connections get moved to 1 replica where other replica is serving 0 connections and after some time the situation shift when 2nd replica serves all connections and 1st serves 0.
I investigated this and found that auroraDB cluster load balancing is done on by time slicing 1-second intervals.
My guess is when pgBouncer creates connection pool all connection are created within 1 second window and all connections end up on 1 read replica.
is there any way I can correct this?
The DB Endpoint is a Route 53 DNS and load balancing is done basically via DNS round robin, each time you resolve the DNS. When you use pgBouncer, is it resolving the DNS once and trying to open connections to the resolved IP? If yes, then this is expected that all your connections are resolved to the same instance. You could fix this conceptually in multiple ways (I'm not too familiar with pgBouncer), but you basically need to somehow make the library resolve the DNS explicitly for each connection, or explicitly add all the instance endpoints to the configuration. The latter is not recommended if you plan on issuing writes using this Connection pool. You don't have any control over who stays as the writer, so you may inadvertently end up sending your writes to a replica.
AuroraDB cluster load balancing is done on by time slicing 1-second intervals
I'm not too sure where you read that. Could you share some references?
I'm writing a backend app using nodejs which execute a lot of http requests to external services and s3.
I have reached to roughly 800 requests per second on a single kubernetes pod.
The pod is limited to a single vcpu and it has reached to 100% usage.
I can scale it to tens of pods to handle the execution of thousands of requests,
but it seems that this limit has reached too soon.
I have tested it in my real backend app and then on a demo pod which does nothing but to send http request using axios.
Does it make sense that a single vcpu kubernetes pod can only handle 800 req / sec? (as client and not as a server).
It's quite hard to propose any advice for the best approach with choosing a proper capacity for the compute resources affordable to your specific needs. However, when you use 1x vCPU in Pod limit requests it equivalents 1 CPU unit for most widely used Cloud providers VM resources.
Thus, I would bet here for adding more CPU units into your Pod than spinning more Pods with a same number of vCPU by Kubernetes scheduler using HPA (Horizontal Pod Autoscaler) feature. Therefore, if you don't have enough capacity on your node, it's very easy to push lots of Pod to be overloaded; and indeed this would not give positive influence on Node compute engine.
In your example, there are two key metric parameters to analyze: latency (time for sending requests and receiving answer) and throughput (requests per second) of HTTP requests; here is always the rule on the top: Increasing the latency will decrease the overall throughput for your requests.
You can also read about Vertical Pod Autoscaler as an option for managing compute resources in Kubernetes cluster.
I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.
We have a stateless (with shared Azure Redis Cache) WebApp that we would like to automatically scale via the Azure auto-scale service. When I activate the auto-scale-out, or even when I activate 3 fixed instances for the WebApp, I get the opposite effect: response times increase exponentially or I get Http 502 errors.
This happens whether I use our configured traffic manager url (which worked fine for months with single instances) or the native url (.azurewebsites.net). Could this have something to do with the traffic manager? If so, where can I find info on this combination (having searched)? And how do I properly leverage auto-scale with traffic-manager failovers/perf? I have tried putting the traffic manager in both failover and performance mode with no evident effect. I can gladdly provide links via private channels.
UPDATE: We have reproduced the situation now the "other way around": On the account where we were getting the frequent 5XX errors, we have removed all load balanced servers (only one server per app now) and the problem disappeared. And, on the other account, we started to balance across 3 servers (no traffic manager configured) and soon got the frequent 502 and 503 show stoppers.
Related hypothesis here: https://ask.auth0.com/t/health-checks-response-with-500-http-status/446/8
Possibly the cause? Any takers?
UPDATE
After reverting all WebApps to single instances to rule out any relationship to load balancing, things ran fine for a while. Then the same "502" behavior reappeared across all servers for a period of approx. 15 min on 04.Jan.16 , then disappeared again.
UPDATE
Problem reoccurred for a period of 10 min at 12.55 UTC/GMT on 08.Jan.16 and then disappeared again after a few min. Checking logfiles now for more info.
UPDATE
Problem reoccurred for a period of 90 min at roughly 11.00 UTC/GMT on 19.Jan.16 also on .scm. page. This is the "reference-client" Web App on the account with a Web App named "dummy1015". "502 - Web server received an invalid response while acting as a gateway or proxy server."
I don't think Traffic Manager is the issue here. Since Traffic Manager works at the DNS level, it cannot be the source of the 5XX errors you are seeing. To confirm, I suggest the following:
Check if the increased response times are coming from the DNS lookup or from the web request.
Introduce Traffic Manager whilst keeping your single instance / non-load-balanced set up, and confirm that the problem does not re-appear
This will help confirm if the issue relates to Traffic Manager or some other aspect of the load-balancing.
Regards,
Jonathan Tuliani
Program Manager
Azure Networking - DNS and Traffic Manager