Cloud Run Qs :: max-instances + concurrency + threads (gunicorn thread) - python-3.x

(I'm learning Cloud Run acknowledge this is not development or code related, but hoping some GCP engineer can clarify this)
I have a PY application running - gunicorn + Flask... just PoC for now, that's why minimal configurations.
cloud run deploy has following flags:
--max-instances 1
--concurrency 5
--memory 128Mi
--platform managed
guniccorn_cfg.py files has following configurations:
workers=1
worker_class="gthread"
threads=3
I'd like to know:
1) max-instances :: if I were to adjust this, does that mean a new physical server machine is provisioned whenever needed ? Or, does the service achieve that by pulling a container image and simply starting a new container instance (docker run ...) on same physical server machine, effectively sharing the same physical machine as other container instances?
2) concurrency :: does one running container instance receive multiple concurrent requests (5 concurrent requests processed by 3 running container instances for ex.)? or does each concurrent request triggers starting new container instance (docker run ...)
3) lastly, can I effectively reach concurrency > 5 by adjusting gunicorn thread settings ? for ex. 5x3=15 in this case.. for ex. 15 concurrent requests being served by 3 running container instances for ex.? if that's true any pros/cons adjusting thread vs adjusting cloud run concurrency?
additional info:
- It's an IO intensive application (not the CPU intensive). Simply grabbing the HTTP request and publishing to pubsub/sub
thanks a lot

First of all, it's not appropriate on Stackoverflow to ask "cocktail questions" where you ask 5 things at a time. Please limit to 1 question at a time in the future.
You're not supposed to worry about where containers run (physical machines, VMs, ...). --max-instances limit the "number of container instances" that you allow your app to scale. This is to prevent ending up with a huge bill if someone was maliciously sending too many requests to your app.
This is documented at https://cloud.google.com/run/docs/about-concurrency. If you specify --concurrency=10, your container can be routed to have at most 10 in-flight requests at a time. So make sure your app can handle 10 requests at a time.
Yes, read Gunicorn documentation. Test if your setting "locally" lets gunicorn handle 5 requests at the same time... Cloud Run’s --concurrency setting is to ensure you don't get more than 5 requests to 1 container instance at any moment.
I also recommend you to read the officail docs more thoroughly before asking, and perhaps also the cloud-run-faq once which pretty much answers all these.

Related

Azure Functions concurrency: maxConcurrentRequests - is it Truly parallel to give a simultaneous execution of all requests happening at the same time

Here in this thread https://stackoverflow.com/a/66163971/6514559 it is explained that
If Azure decides that your App needs to scale and creates a new
host, and say there are two hosts, then values of these params (maxConcurrentRequests ,FUNCTIONS_WORKER_PROCESS_COUNT) are
applied per host not across host.
If your App has multiple Functions
then maxConcurrentRequests applies to-all/across Functions within
this host, not per Function.
The questions are,
Is it possible to have more than one function app on a single host
(Is this what is controlled by FUNCTIONS_WORKER_PROCESS_COUNT?)
maxConcurrentRequests = 100 does this really means that all
100 requests will be processed in parallel (simultaneously) by a single host
(Consumption plan , 1 CPU,1.5GB Host ) . This thread here suspects everything is executed in series?!
since each instance of the Functions host in the Consumption plan is limited to 1.5 GB of memory and one CPU (Reference), how can it run parallel loads with one CPU? On a different thought this does say ACU per instance is 100 for Consumption Plan
See this, this and this. And OP already read it, but this also for completeness.
Is it possible to have more than one function app on a single host
Documentation is very confusing. AFAIK:
On consumption plan, no.
On premium/app-service plan, there is a hint that might mean relation is one host to many apps, but IMO it's debatable.
(Is this what is controlled by FUNCTIONS_WORKER_PROCESS_COUNT?)
NO.
Need to understand the terms:
Function App: One Function App. Top level Azure Resource. A logical collection of Functions.
Function: One Function with in/out-trigger/binding(s). One Function App contains one or more Functions.
Function Host: Virtual/physical host where Function App runs as Linux/Windows process.
Worker Process: One process (one pid) running on a Function Host.
One Worker Process hosts all Functions of one Function App.
One Host will have FUNCTIONS_WORKER_PROCESS_COUNT (default 1) Worker Processes running on it, sharing all resources (RAM, CPU, ..)
maxConcurrentRequests = 100 does this really means that all 100 requests will be processed in parallel (simultaneously) by a single host (Consumption plan , 1 CPU,1.5GB Host ) .
Discounting cold start problems, execution would be in parallel within limits of selected plan.
This thread here suspects everything is executed in series?!
I'm sure there is an explanation. There is unambiguous documentation that says requests do get executed in parallel, within limits.

Best way to approach long running tasks (but not cpu, memory intensive) in GCP

We built a web application where we utilized firebase functions for lightweight works such as login, update profile etc. and we deployed 2 functions to App Engine with Nodejs.
Function 1: Downloading an audio/video file from firebase storage and converting it with ffmpeg and uploading converted version back to storage.
But App Engine is terminating with a signal in the middle of download process (after ~40 seconds) if the file is larger (>500MB)
Function 2: Calling Google API (ASR) and waiting response (progress %) and writing this response to firestore until it's completed (100%).
This process may take between 1 min - 20 min depending on the file length. But here we get two different problems.
Either App Engine creates a new instance in the middle of API call process and kills current instance (since we set instances# to 1) even there is no concurrent requests.
I don't understand this behavior since this should not require intensive CPU or memory usage. App Engine gives following Info in logs:
This request caused a new process to be started for your application,
and thus caused your application code to be loaded for the first time.
This request may thus take longer and use more CPU than a typical
request for your application
Or App Engine terminates due to idle_timeout even it is waiting async API response and writing it to db.
It looks like when there is no incoming requests, App Engine is considering itself as idle and terminating after a while (~10 minutes)
We are new to GCP and App Engine, so maybe we are using wrong product (e.g. Compute Engine?) or doing the wrong implementation. Also we saw PubSub, Cloud Tasks etc. which looks like a solution for our case.
Thus I wonder what could be most elegant way to approach the problem and implement solution?
Any comment, feedback is appreciated.
Regards
A. Faruk Acar
App Engine app.yaml Configuration
runtime: nodejs10
manual_scaling:
instances: 1
App Engine has a maximum timeout per request
10 minutes for App Engine Standard
60 minutes for App Engine Flex
in both cases the default values are less.
So for what you describe as your process it is not the optimal solution.
Cloud Tasks has similar limitations as App Engine so you might get to similar problems.
Compute Engine can work for you as it is virtual machines where you control the configuration. To keep it cost effective see what is the smallest computer type which can run your app.

Load test on Azure

I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.

Run Node JS on a multi-core cluster cloud

Is there a service or framework or any way that would allow me to run Node JS for heavy computations letting me choose the number of cores?
I'll be more specific: let's say I want to run some expensive computation for each of my users and I have 20000 users.
So I want to run the expensive computation for each user on a separate thread/core/computer, so I can finish the computation for all users faster.
But I don't want to deal with low level server configuration, all I'm looking for is something similar to AWS Lambda but for high performance computing, i.e., letting me scale as I please (maybe I want 1000 cores).
I did simulate this with AWS Lambda by having a "master" lambda that receives the data for all 20000 users and then calls a "computation" lambda for each user. Problem is, with AWS Lambda I can't make 20000 requests and wait for their callbacks at the same time (I get a request limit exceeded error).
With some setup I could user Amazon HPC, Google Compute Engine or Azure, but they only go up to 64 cores, so if I need more than that, I'd still have to setup all the machines I need separately and orchestrate the communication between them with something like Open MPI, handling the different low level setups for master and compute instances (accessing via ssh and etc).
So is there any service I can just paste my Node JS code, maybe choose the number of cores and run (not having to care about OS, or how many computers there are in my cluster)?
I'm looking for something that can take that code:
var users = [...];
function expensiveCalculation(user) {
// ...
return ...;
}
users.forEach(function(user) {
Thread.create(function() {
save(user.id, expensiveCalculation(user));
});
});
And run each thread on a separate core so they can run simultaneously (therefore finishing faster).
I think that your problem is that you feel the need to process 20000 inputs at once on the same machine. Have you looked into SQS from Amazon? Maybe you push those 20000 inputs into SQS and then have a cluster of servers pull from that queue and process each one individually.
With this approach you could add as many servers, processes or add as many AWS Lambda invokes as you want. You could even use a combination of the 3 to see what's cheaper or faster. Adding resources will only reduce the amount of time it would take to complete the computations. Then you wouldn't have to wait for 20000 requests or anything to complete. The process could tell you when it completes the computation by sending some notification after it completes.
So basically, you could have a simple application that just grabbed 10 of these inputs at a time and ran your computation on them. After it finishes you could then have this process delete them from SQS and send a notification somewhere (Maybe SNS?) to notify the user or some other system that they are done. Then it would repeat the process.
After that you could scale the process horizontally and you wouldn't need a super computer in order to process this. So you could either get a cluster of EC2 instances that ran several of these applications a piece or have a Lambda function invoked periodically in order to pull items out of SQS and process them.
EDIT:
To get started using an EC2 instance I would look at the docs here. To start with I would pick the smallest, cheapest instance (T2.micro I think), and leave everything at it's default. There's no need to open any port other than the one for SSH.
Once it's setup and you login, the first thing you need to do is run aws configure to setup your profile that way you can access AWS resources from the instance. After that install Node and get your application on there using git or something. Once it's setup though, go to the EC2 console and in your Actions menu there will be an option to create an image from the instance.
Once you create an image, then you can go to Auto Scaling groups and create a launch configuration using that AMI. Then it'll let you specify how many instances you want to run.
I feel like this could also be done more easily using their container service, but honestly I don't know how to use it yet.

How to fail over node.js timer on amazon load balancer?

I have setup 2 instance under aws load balancer. I have deployed node.js web services + mongodb in both instance. load balancer works fine with web services.
But, Problem is I have one timer service (node.js service only). the behavior of this timer is updating my mongodb based on some calculation.
My problem is, I must need to run this timer service (timer.js) at only one aws instance (out of 2) at same time. and expected that if one aws instance goes down then timer service at other instance will come up.
i know elb not providing this kind of facility.Can any one please help me to make it done ?
Condition : At a time only one timer service must be run with amazon load balancer.
Thanks.
You would have to implement this yourself using a locking algorithm using a shared data store that supports atomic operations
Alternatively, consider starting a "timer" server in an Auto Scale Group of Min:1, Max: 1 so Amazon keeps it running. This instance can be a t2.micro which is very cheap. It can either run the job itself, or just make an http request to your load balancer to run the job at the desired internal. If you so that, only one of your servers will run each job
Wouldn't it make more sense to handle this like any other "service" that needs to keep running?
upstart service
running node.js server using upstart causes 'terminated with status 127' on 'ubuntu 10.04'
This guy had a bad path in his file but his upstart script looks okay
monit
Node.js (sudo) and monit

Resources