Multi-threading for heavy transcoding tasks in NodeJS on flex GAE - node.js

I am transcoding lots of MP4 to FLAC/WAV using fluent-ffmpeg on flex GAE
Thinking about how GAE will handle 10-20 concurrent transcoding operations
Transcoding is computationally expensive and GAE might? spawn new instances for each request if not optimized
app.yaml
runtime: nodejs
env: flex
resources:
cpu: 2
memory_gb: 3.75
automatic_scaling:
min_num_instances: 1
max_num_instances: 30
cpu_utilization:
target_utilization: 0.8
Is there a way to allocate the load on a single GAE instance? Node web workers?
What's the conventional resources section in app.yaml for such tasks?

If you want to avoid the creation on new instances ( horizontal scaling ) for each request you should work on your vertical scaling - choose a powerful enough instance class that you think would be enough to process your requests and then set the max_num_instances parameter accordingly to your desire. Of course, being aware on how you structure and design your code in an "economical" way is the most important part.
If you want to be sure that only one GAE instance is running, set to manual scaling and chooseinstances: 1. Or leave it on automatic_scaling and set max_num_instances: 1. ( I did not try this, but I do not see why it wouldn't work).
Keep in mind to change the value of the parameter max_concurrent_requests to whatever value you want. By default it will be set to 10. So you would not be able to process more than 10 concurrent requests at a time.

Related

How to convert a multiprocess Flask/unicorn to a single multithreaded process

I would like to cache a large amount of data in a Flask application. Currently it runs on K8S pods with the following unicorn.ini
bind = "0.0.0.0:5000"
workers = 10
timeout = 900
preload_app = True
To avoid caching the same data in those 10 workers I would like to know if Python supports a way to multi-thread instead of multi-process. This would be very easy in Java but I am not sure if it is possible in Python. I know that you can share cache between Python instances using the file system or other methods. However it would be a lot simpler if it is all share in the same process space.
Edited:
There are couple post that suggested threads are supported in Python. This comment by Filipe Correia, or this answer in the same question.
Based on the above comment the Unicorn design document talks about workers and threads:
Since Gunicorn 19, a threads option can be used to process requests in multiple threads. Using threads assumes use of the gthread worker.
Based on how Java works, to shared some data among threads, I would need one worker and multiple threads. Based on this other link
I know it is possible. So I assume I can change my gunicorn configuration as follows:
bind = "0.0.0.0:5000"
workers = 1
threads = 10
timeout = 900
preload_app = True
This should give me 1 worker and 10 threads which should be able to process the same number of request as current configuration. However the question is: Would the cache still be instantiated once and shared among all the threads? How or where should I instantiate the cache to make sure is shared among all the threads.
would like to ... multi-thread instead of multi-process.
I'm not sure you really want that. Python is rather different from Java.
workers = 10
One way to read that is "ten cores", sure.
But another way is "wow, we get ten GILs!"
The global interpreter lock must be held
before the interpreter interprets a new bytecode instruction.
Ten interpreters offers significant parallelism,
executing ten instructions simultaneously.
Now, there are workloads dominated by async I/O, or where
the interpreter calls into a C extension to do the bulk of the work.
If a C thread can keep running, doing useful work
in the background, and the interpreter gathers the result later,
terrific. But that's not most workloads.
tl;dr: You probably want ten GILs, rather than just one.
To avoid caching the same data in those 10 workers
Right! That makes perfect sense.
Consider pushing the cache into a storage layer, or a daemon like Redis.
Or access memory-resident cache, in the context of your own process,
via mmap or shmat.
When running Flask under Gunicorn, you are certainly free
to set threads greater than 1,
though it's likely not what you want.
YMMV. Measure and see.

Why does the response time curve of NodeJS API become sinus like under load?

I am currently performing an API Load Test on my NodeJS API using JMeter and am completely new to the field. The API is deployed on an IBM Virtual Server with 4 vCPUs and 8GB of RAM.
One of my load tests includes stress testing the API in a 2500 thread (users) configuration with a ramp-up period of 2700ms (45 min) on infinite loop. The goal is not to reach 2500 threads but rather to see at what point my API would throw its first error.
I am only testing one endpoint on my API, which performs a bubble sort to simulate a CPU intensive task. Using Matplotlib I plotted the results of the experiment. I plotted the response time in ms over the active threads.
I am unsure why the response time curve becomes sinus like once crossing roughly 1100 Threads. I expected the response time curve keep rising in the same manner it does in the beginning (0 - 1100 threads). Is there an explanation for the sinus like behaviour of the curve towards the end?
Thank you!
Graph:
Red - Errors
Blue - Response time
There could be 2 possible reasons for this:
Your application cannot handle such a big load and performs frequent garbage collection in order to free up resources or tasks are queuing up as application cannot process them as they come. You can try using i.e. JMeter PerfMon Plugin to ensure that the system under test doesn't lack CPU or RAM
JMeter by default comes up with relatively low JVM Heap size and a very little GC tuning (like it's described in Concurrent, High Throughput Performance Testing with JMeter article where the guy has very similar symptoms) so it might be the case JMeter cannot send requests fast enough, make sure to follow JMeter Best Practices and consider going for distributed testing if needed.

Google App Engine - low performance with concurrent requests

I have made an experiment with Google App Engine.
I've added execution time measurements to my Python 3 web service . They are measuring real time that passed during code execution, not CPU time (using time.time()).
One of the measurements is taking whole python function code execution: measurement stars as the first line of the function and is ended right before returning the result.
For a simple test input, timing is as expected: it took around 0.7 seconds to perform all function operations. As can be seen in logs.
Presented times are similar regardless if data is requested sequentially by one thread or in parallel by 16 threads. I am using JMeter for simulating the load.
The more interesting part is the overall request time.
When queried by one thread sequentially response time is similar to the time taken by code execution:
But for some reason, when service is queried in parallel by 16 threads overall response time grows to 11 seconds:
I am surprised by this behavior.
I checked the resources used by service. In the peak moment CPU was used by 40% and RAM usage was under 600 MB.
Here is the app.yaml configuration for this service:
runtime: python
env: flex
service: my_service_name
entrypoint: gunicorn -b :$PORT main:app --timeout 240 --limit-request-line 0
runtime_config:
python_version: 3
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
cpu_utilization:
target_utilization: 0.8
resources:
cpu: 1
memory_gb: 2
When I am starting webservice locally on my laptop I am getting the same average response time regardless of the number of concurrent threads.
Any tips or hints how to configure this to work efficient for the parallel requests highly appreciated.
GAE flex is much slower to scale than GAE standard. If you need to scale quickly, see if GAE standard can work for you.
For how long did you run the parallel test? I expect that if you run it for a sufficient length of time (5-15 minutes), then your app will scale up and the response time will be back below 1 second.

Azure autoscale scale in kills in use instances

I'm using Azure Autoscale feature to process hundreds of files. The system scales up correctly to 8 instances and each instance processes one file at a time.
The problem is with scaling in. Because the scale in rules seem to be based on ALL instances, if I tell it to reduce the instance count back to 1 after an average CPU load of < 25% it will arbitrarily kill instances that are still processing data.
Is there a way to prevent it from shutting down individual instances that are still in use?
Scale down will remove the highest instance numbers first. For example, if you have WorkerRole_IN_0, WorkerRole_IN_1, ..., WorkerRole_IN_8, and then you scale down by 1, Azure will remove WorkerRole_IN_8 first. Azure has no idea what your code is doing (ie. if it is still processing a file) or if it is finished and ready to shut down.
You have a few options:
If the file processing is quick, you can delay the shutdown for up to 5 minutes in the OnStop event, giving your instance enough time to finish processing the file. This is the easiest solution to implement, but not the most reliable.
If processing the file can be broken up into shorter chunks of work then you can have the instances process chunks until the file is complete. This way it doesn't really matter if an arbitrary instance is shut down since you don't lose any significant amount of work and another instance will pick up where it left off. See https://learn.microsoft.com/en-us/azure/architecture/patterns/pipes-and-filters for a pattern. This is the ideal solution as it is an optimized architecture for distributed workloads, but some workloads (ie. image/video processing) may not be able to break up easily.
You can implement your own autoscale algorithm and manually shut down individual instances that you choose. To do this you would call the Delete Role Instance API (https://msdn.microsoft.com/en-us/library/azure/dn469418.aspx). This requires some external process to be monitoring your workload and executing management operations so may not be a good solution depending on your infrastructure.

Is reproducible benchmarking possible?

I need to test some node frameworks, or at least their routing part. That means from the request arrives at the node process for processing until a route has been decided and a function/class with the business logic is called, e.g. just before calling it. I have looked hard and long for a suitable approach, but concluded that it must be done directly in the code and not using an external benchmark tool. I fear measuring the wrong attributes. I tried artillery and ab but they measure a lot more attributes then I want to measure, like RTT, bad OS scheduling, random tasks executing in the OS and so on. My initial benchmarks for my custom routing code using process.hrtime() shows approx. 0.220 ms (220 microseconds) execution time but the external measure shows 0.700 (700 microseconds) which is not an acceptable difference, since it's 3.18x additional time. Sometimes execution time jumps to 1.x seconds due to GC or system tasks. Now I wonder how a reproducible approach would look like? Maybe like this:
Use Docker with Scientific Linux to get a somewhat controlled environment.
A minimal docker container install, node enabled container only, no extras.
Store time results in global scope until test is done and then save to disk.
Terminate all applications with high/moderate diskIO and/or CPU on host OS.
Measure time as explained before and crossing my fingers.
Any other recommendations to take into consideration?

Resources