So before I used kubernetes the general rule I used for running multiple express instances on a VM was one per cpu. That seemed to give the best performance.
For kubernetes, would it be wise to have a replica per node cpu? Or should I let the horizontalpodautoscaler decide? The cluster has a node autoscaler.
Thanks for any advice!
good question !
You need to consider 4 things :
Run the pod using Deployment so you enable replication, rolling update,...so on
Set resources.limits to your container definition. this is mandatory for autoscaling , because HPA is monitoring the percentage of usage, and if there is NO limit, there will be NEVER percentage, then HPA will never reach the threshold.
Set resources.requests. This will help the scheduler to estimate how much the app needs, so it will be assigned to the suitable Node per its current capacity.
Set HPA threshold: The percentage of usage (CPU, memory) when the HPA will trigger scale out or scale in.
for your situation, you said "one per cpu".. then, it should be:
containers:
- name: express
image: myapp-node
#.....
resources:
requests:
memory: "256Mi"
cpu: "750m"
limits:
memory: "512Mi"
cpu: "1000m" # <-- 🔴 match what you have in the legacy deployment
you may wonder why I put memory limits/requests without any input from your side ?
The answer is that I put it randomly. Your task is to monitor your application, and adjust all these values accordingly.
Related
I am currently seeing a strange issue where I have a Pod that is constantly being Evicted by Kubernetes.
My Cluster / App Information:
Node size: 7.5GB RAM / 2vCPU
Application Language: nodejs
Use Case: puppeteer website extraction (I have code that loads a website, then extracts an element and repeats this a couple of times per hour)
Running on Azure Kubernetes Service (AKS)
What I tried:
Check if Puppeteer is closed correctly and that I am removing any chrome instances. After adding a force killer it seems to be doing this
Checked kubectl get events where it is showing the lines:
8m17s Normal NodeHasSufficientMemory node/node-1 Node node-1 status is now: NodeHasSufficientMemory
2m28s Warning EvictionThresholdMet node/node-1 Attempting to reclaim memory
71m Warning FailedScheduling pod/my-deployment 0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 3 node(s) didn't match node selector
Checked kubectl top pods where it shows it was only utilizing ~30% of the node's memory
Added resource limits in my kubernetes .yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-d
spec:
replicas: 1
template:
spec:
containers:
- name: main
image: my-image
imagePullPolicy: Always
resources:
limits:
memory: "2Gi"
Current way of thinking:
A node has X memory total available, however from X memory only Y is actually allocatable due to reserved space. However when running os.totalmem() in node.js I am still able to see that Node is allowed to allocate the X memory.
What I am thinking here is that Node.js is allocating up to X due to its Garbage Collecting which should actually kick in at Y instead of X. However with my limit I actually expected it to see the limit instead of the K8S Node memory limit.
Question
Are there any other things I should try to resolve this? Did anyone have this before?
You NodeJS app is not aware that it runs in container. It sees only the amount of memory that Linux kernel reports (which always reports the total node memory). You should make your app aware of cgroup limits, see https://medium.com/the-node-js-collection/node-js-memory-management-in-container-environments-7eb8409a74e8
With regard to Evictions: when you've set memory limits - did that solve your problems with evictions?
And don't trust kubectl top pods too much. It always shows data with some delay.
My goal:
Implement a cron job run once per week and I intend to implement this topology on Knative to save the computing resources:
PingSource -> knative service
The PingSource will emit a dummy event to a knative service once per week just to bring up 1 knative service pod. The knative service pod will get huge amount of data and then process them.
My concern:
If I set enable-scale-to-zero to true, the Knative pod autoscaler probably shutdown the knative service pod even when the pod has not finished its work.
So far, I explored:
The scale-to-zero-grace-period which can be configured to tell the auto scaler how long it should wait after the last traffic ends to shutdown the pod. But I don't think this approach is subtle. I prefer somewhat similar to readinessProbe or livenessProbe. The auto scaler should send a probe to know whether the pod is processing something before sending the kill signal.
In addition, according to knative's docs, there are 2 type of event sink: callable and addressable. Addressable and Callable both return the response or acknowledgement. Would the knative auto scaler consider the pod as handling the request till the pod return the response/acknowledgement? So as long as the pod does not response, it won't be removed by the auto scaler.
The Knative autoscaler relies on the pod strictly working in a request/response fashion. As long as the "huge amount of data" is processed as part of an HTTP request (or Websocket session, or gRPC session etc.) the pod will not even be considered for deletion.
What will not work is sending the request, immediately return and then munging the data in the background. The autoscaler will think that there's no activity at all and thus shut it down. There is a sandbox project that tries to implement such asynchronous semantics though.
(I'm learning Cloud Run acknowledge this is not development or code related, but hoping some GCP engineer can clarify this)
I have a PY application running - gunicorn + Flask... just PoC for now, that's why minimal configurations.
cloud run deploy has following flags:
--max-instances 1
--concurrency 5
--memory 128Mi
--platform managed
guniccorn_cfg.py files has following configurations:
workers=1
worker_class="gthread"
threads=3
I'd like to know:
1) max-instances :: if I were to adjust this, does that mean a new physical server machine is provisioned whenever needed ? Or, does the service achieve that by pulling a container image and simply starting a new container instance (docker run ...) on same physical server machine, effectively sharing the same physical machine as other container instances?
2) concurrency :: does one running container instance receive multiple concurrent requests (5 concurrent requests processed by 3 running container instances for ex.)? or does each concurrent request triggers starting new container instance (docker run ...)
3) lastly, can I effectively reach concurrency > 5 by adjusting gunicorn thread settings ? for ex. 5x3=15 in this case.. for ex. 15 concurrent requests being served by 3 running container instances for ex.? if that's true any pros/cons adjusting thread vs adjusting cloud run concurrency?
additional info:
- It's an IO intensive application (not the CPU intensive). Simply grabbing the HTTP request and publishing to pubsub/sub
thanks a lot
First of all, it's not appropriate on Stackoverflow to ask "cocktail questions" where you ask 5 things at a time. Please limit to 1 question at a time in the future.
You're not supposed to worry about where containers run (physical machines, VMs, ...). --max-instances limit the "number of container instances" that you allow your app to scale. This is to prevent ending up with a huge bill if someone was maliciously sending too many requests to your app.
This is documented at https://cloud.google.com/run/docs/about-concurrency. If you specify --concurrency=10, your container can be routed to have at most 10 in-flight requests at a time. So make sure your app can handle 10 requests at a time.
Yes, read Gunicorn documentation. Test if your setting "locally" lets gunicorn handle 5 requests at the same time... Cloud Run’s --concurrency setting is to ensure you don't get more than 5 requests to 1 container instance at any moment.
I also recommend you to read the officail docs more thoroughly before asking, and perhaps also the cloud-run-faq once which pretty much answers all these.
I'm writing a backend app using nodejs which execute a lot of http requests to external services and s3.
I have reached to roughly 800 requests per second on a single kubernetes pod.
The pod is limited to a single vcpu and it has reached to 100% usage.
I can scale it to tens of pods to handle the execution of thousands of requests,
but it seems that this limit has reached too soon.
I have tested it in my real backend app and then on a demo pod which does nothing but to send http request using axios.
Does it make sense that a single vcpu kubernetes pod can only handle 800 req / sec? (as client and not as a server).
It's quite hard to propose any advice for the best approach with choosing a proper capacity for the compute resources affordable to your specific needs. However, when you use 1x vCPU in Pod limit requests it equivalents 1 CPU unit for most widely used Cloud providers VM resources.
Thus, I would bet here for adding more CPU units into your Pod than spinning more Pods with a same number of vCPU by Kubernetes scheduler using HPA (Horizontal Pod Autoscaler) feature. Therefore, if you don't have enough capacity on your node, it's very easy to push lots of Pod to be overloaded; and indeed this would not give positive influence on Node compute engine.
In your example, there are two key metric parameters to analyze: latency (time for sending requests and receiving answer) and throughput (requests per second) of HTTP requests; here is always the rule on the top: Increasing the latency will decrease the overall throughput for your requests.
You can also read about Vertical Pod Autoscaler as an option for managing compute resources in Kubernetes cluster.
In brief, I am having trouble supporting more than 5000 read requests per minute from a data API leveraging Postgresql, Node.js, and node-postgres. The bottleneck appears to be in between the API and the DB. Here are the implmentation details.
I'm using an AWS Postgresql RDS database instance (m4.4xlarge - 64 GB mem, 16 vCPUs, 350 GB SSD, no provisioned IOPS) for a Node.js powered data API. By default the RDS's max_connections=5000. The node API is load-balanced across two clusters with 4 processes each (2 Ec2s with 4 vCPUs running the API with PM2 in cluster-mode). I use node-postgres to bind the API to the Postgresql RDS, and am attempting to use it's connection pooling feature. Below is a sample of my connection pool code:
var pool = new Pool({
user: settings.database.username,
password: settings.database.password,
host: settings.database.readServer,
database: settings.database.database,
max: 25,
idleTimeoutMillis: 1000
});
/* Example of pool usage */
pool.query('SELECT my_column FROM my_table', function(err, result){
/* Callback code here */
});
Using this implementation and testing with a load tester, I can support about 5000 requests over the course of one minute, with an average response time of about 190ms (which is what I expect). As soon as I fire off more than 5000 requests per minute, my response time increases to over 1200ms in the best of cases and in the worst of cases the API begins to frequently timeout. Monitoring indicates that for the EC2s running the Node.js API, CPU utilization remains below 10%. Thus my focus is on the DB and the API's binding to the DB.
I have attempted to increase (and decrease for that matter) the node-postgres "max" connections setting, but there was no change in the API response/timeout behavior. I've also tried provisioned IOPS on the RDS, but no improvement. Also, interestingly, I scaled the RDS up to m4.10xlarge (160 GB mem, 40 vCPUs), and while the RDS CPU utilization dropped greatly, the overall performance of the API worsed considerably (couldn't even support the 5000 requests per minute that I was able to with the smaller RDS).
I'm in unfamilar territory in many respects and am unsure of how to best determine which of these moving parts is bottlenecking API performance when over 5000 requests per minute. As noted I have attempted a variety of adjustments based on the review of Postgresql configuration documentation and node-postgres documentation, but to no avail.
If anyone has advice on how to diagnose or optimize I would greatly appreciate it.
UPDATE
After scaling up to m4.10xlarge, i performed a series of load-tests, varying the number of request/min and the max number of connections in each pool. Here are some screen captures of monitoring metrics:
In order to support more then 5k requests, while maintaining the same response rate, you'll need better hardware...
The simple math states that:
5000 requests*190ms avg = 950k ms divided into 16 cores ~ 60k ms per core
which basically means your system was highly loaded.
(I'm guessing you had some spare CPU as some time was lost on networking)
Now, the really interesting part in your question comes from the scale up attempt: m4.10xlarge (160 GB mem, 40 vCPUs).
The drop in CPU utilization indicates that the scale up freed DB time resources - So you need to push more requests!
2 suggestions:
Try increasing the connection pool to max: 70 and look at the network traffic (depending on the amount of data you might be hogging the network)
also, are your requests to the DB a-sync from the application side? make sure your app can actually push more requests.
The best way is to make use of a separate Pool for each API call, based on the call's priority:
const highPriority = new Pool({max: 20}); // for high-priority API calls
const lowPriority = new Pool({max: 5}); // for low-priority API calls
Then you just use the right pool for each of the API calls, for optimum service/connection availability.
Since you are interested in read performance can set up replication between two (or more) PostgreSQL instances, and then use pgpool II to load balance between the instances.
Scaling horizontally means you won't start hitting the max instance sizes at AWS if you decide next week you need to go to 10,000 concurrent reads.
You also start to get some HA in your architecture.
--
Many times people will use pgbouncer as a connection pooler even if they have one built into their application code already. pgbouncer works really well and is typically easier to configure and manage that pgpool, but it doesn't do load balancing. I'm not sure if it would help you very much in this scenario though.