Google App Engine - low performance with concurrent requests

Google App Engine - low performance with concurrent requests - multithreading

I have made an experiment with Google App Engine.
I've added execution time measurements to my Python 3 web service . They are measuring real time that passed during code execution, not CPU time (using time.time()).
One of the measurements is taking whole python function code execution: measurement stars as the first line of the function and is ended right before returning the result.
For a simple test input, timing is as expected: it took around 0.7 seconds to perform all function operations. As can be seen in logs.
Presented times are similar regardless if data is requested sequentially by one thread or in parallel by 16 threads. I am using JMeter for simulating the load.
The more interesting part is the overall request time.
When queried by one thread sequentially response time is similar to the time taken by code execution:
But for some reason, when service is queried in parallel by 16 threads overall response time grows to 11 seconds:
I am surprised by this behavior.
I checked the resources used by service. In the peak moment CPU was used by 40% and RAM usage was under 600 MB.
Here is the app.yaml configuration for this service:
runtime: python
env: flex
service: my_service_name
entrypoint: gunicorn -b :$PORT main:app --timeout 240 --limit-request-line 0
runtime_config:
python_version: 3
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
cpu_utilization:
target_utilization: 0.8
resources:
cpu: 1
memory_gb: 2
When I am starting webservice locally on my laptop I am getting the same average response time regardless of the number of concurrent threads.
Any tips or hints how to configure this to work efficient for the parallel requests highly appreciated.

GAE flex is much slower to scale than GAE standard. If you need to scale quickly, see if GAE standard can work for you.
For how long did you run the parallel test? I expect that if you run it for a sufficient length of time (5-15 minutes), then your app will scale up and the response time will be back below 1 second.

Related

Why does the response time curve of NodeJS API become sinus like under load?

I am currently performing an API Load Test on my NodeJS API using JMeter and am completely new to the field. The API is deployed on an IBM Virtual Server with 4 vCPUs and 8GB of RAM.
One of my load tests includes stress testing the API in a 2500 thread (users) configuration with a ramp-up period of 2700ms (45 min) on infinite loop. The goal is not to reach 2500 threads but rather to see at what point my API would throw its first error.
I am only testing one endpoint on my API, which performs a bubble sort to simulate a CPU intensive task. Using Matplotlib I plotted the results of the experiment. I plotted the response time in ms over the active threads.
I am unsure why the response time curve becomes sinus like once crossing roughly 1100 Threads. I expected the response time curve keep rising in the same manner it does in the beginning (0 - 1100 threads). Is there an explanation for the sinus like behaviour of the curve towards the end?
Thank you!
Graph:
Red - Errors
Blue - Response time

There could be 2 possible reasons for this:
Your application cannot handle such a big load and performs frequent garbage collection in order to free up resources or tasks are queuing up as application cannot process them as they come. You can try using i.e. JMeter PerfMon Plugin to ensure that the system under test doesn't lack CPU or RAM
JMeter by default comes up with relatively low JVM Heap size and a very little GC tuning (like it's described in Concurrent, High Throughput Performance Testing with JMeter article where the guy has very similar symptoms) so it might be the case JMeter cannot send requests fast enough, make sure to follow JMeter Best Practices and consider going for distributed testing if needed.

Why is Python consistently struggling to keep up with constant generation of asyncio tasks?

I have a Python project with a server that distributes work to one or more clients. Each client is given a number of assignments which contain parameters for querying a target API. This includes a maximum number of requests per second they can make with a given API key. The clients process the response and send the results back to the server to store into a database.
Both the server and clients use Tornado for asynchronous networking. My initial implementation for the clients relied on the PeriodicCallback to ensure that n-number of calls to the API would occur. I thought that this was working properly as my tests would last 1-2 minutes.
I added some telemetry to collect statistics on performance and noticed that the clients were actually having issues after almost exactly 2 minutes of runtime. I had set the API requests to 20 per second (the maximum allowed by the API itself) which the clients could reliably hit. However, after 2 minutes performance would fluctuate between 12 and 18 requests per second. The number of active tasks steadily increased until it hit the maximum amount of active assignments (100) given from the server and the HTTP request time to the API was reported by Tornado to go from 0.2-0.5 seconds to 6-10 seconds. Performance is steady if I only do 14 requests per second. Anything higher than 15 requests will experience issues 2-3 minutes after starting. Logs can be seen here. Notice how the column of "Active Queries" is steady until 01:19:26. I've truncated the log to demonstrate
I believed the issue was the use of a single process on the client to handle both communication to the server and the API. I proceeded to split the primary process into several different processes. One handles all communication to the server, one (or more) handles queries to the API, another processes API responses into a flattened class, and finally a multiprocessing Manager for Queues. The performance issues were still present.
I thought that, perhaps, Tornado was the bottleneck and decided to refactor. I chose aiohttp and uvloop. I split the primary process in a similar manner to that in the previous attempt. Unfortunately, performance issues are unchanged.
I took both refactors and enabled them to split work into several querying processes. However, no matter how much you split the work, you still encounter problems after 2-3 minutes.
I am using both Python 3.7 and 3.8 on MacOS and Linux.
At this point, it does not appear to be a limitation of a single package. I've thought about the following:
Python's asyncio library cannot handle more than 15 coroutines/tasks being generated per second
I doubt that this is true given that different libraries claim to be able to handle several thousand messages per second simultaneously. Also, we can hit 20 requests per second just fine at the start with very consistent results.
The API is unable to handle more than 15 requests from a single client IP
This is unlikely as I am not the only user of the API and I can request 20 times per second fairly consistently over an extended period of time if I over-subscribe processes to query from the API.
There is a system configuration causing the limitation
I've tried both MacOS and Debian which yield the same results. It's possible that's it a *nix problem.
Variations in responses cause a backlog which grows linearly until it cannot be tackled fast enough
Sometimes responses from the API grow and shrink between 0.2 and 1.2 seconds. The number of active tasks returned by asyncio.all_tasks remains consistent in the telemetry data. If this were true, we wouldn't be consistently encountering the issue at the same time every time.
We're overtaxing the hardware with the number of tasks generated per second and causing thermal throttling
Although CPU temperatures spike, neither MacOS nor Linux report any thermal throttling in the logs. We are not hitting more than 80% CPU utilization on a single core.
At this point, I'm not sure what's causing it and have considered refactoring the clients into a different language (perhaps C++ with Boost libraries). Before I dive into something so foolish, I wanted to ask if I'm missing something simple.

Conclusion
Performance appears to vary wildly depending on time of day. It's likely to be the API.
How this conclusion was made
I created a new project to demonstrate the capabilities of asyncio and determine if it's the bottleneck. This project takes two websites, one to act as the baseline and the other is the target API, and runs through different methods of testing:
Spawn one process per core, pass a semaphore, and query up to n-times per second
Create a single event loop and create n-number of tasks per second
Create multiple processes with an event loop each to distribute the work, with each loop performing (n-number / processes) tasks per second
(Note that spawning processes is incredibly slow and often commented out unless using high-end desktop processors with 12 or more cores)
The baseline website would be queried up to 50 times per second. asyncio could complete 30 tasks per second reliably for an extended period, with each task completing their run in 0.01 to 0.02 seconds. Responses were very consistent.
The target website would be queried up to 20 times per second. Sometimes asyncio would struggle despite circumstances being identical (JSON handling, dumping response data to queue, returning immediately, no CPU-bound processing). However, results varied between tests and could not always be reproduced. Responses would be under 0.4 seconds initially but quickly increase to 4-10 seconds per request. 10-20 requests would return as complete per second.
As an alternative method, I chose a parent URI for the target website. This URI wouldn't have a large query to their database but instead be served back with a static JSON response. Responses bounced between 0.06 seconds to 2.5-4.5 seconds. However, 30-40 responses would be completed per second.
Splitting requests across processes with their own event loop would decrease response time in the upper-bound range by almost half, but still took more than one second each to complete.
The inability to reproduce consistent results every time from the target website would indicate that it's a performance issue on their end.

Multi-threading for heavy transcoding tasks in NodeJS on flex GAE

I am transcoding lots of MP4 to FLAC/WAV using fluent-ffmpeg on flex GAE
Thinking about how GAE will handle 10-20 concurrent transcoding operations
Transcoding is computationally expensive and GAE might? spawn new instances for each request if not optimized
app.yaml
runtime: nodejs
env: flex
resources:
cpu: 2
memory_gb: 3.75
automatic_scaling:
min_num_instances: 1
max_num_instances: 30
cpu_utilization:
target_utilization: 0.8
Is there a way to allocate the load on a single GAE instance? Node web workers?
What's the conventional resources section in app.yaml for such tasks?

If you want to avoid the creation on new instances ( horizontal scaling ) for each request you should work on your vertical scaling - choose a powerful enough instance class that you think would be enough to process your requests and then set the max_num_instances parameter accordingly to your desire. Of course, being aware on how you structure and design your code in an "economical" way is the most important part.
If you want to be sure that only one GAE instance is running, set to manual scaling and chooseinstances: 1. Or leave it on automatic_scaling and set max_num_instances: 1. ( I did not try this, but I do not see why it wouldn't work).
Keep in mind to change the value of the parameter max_concurrent_requests to whatever value you want. By default it will be set to 10. So you would not be able to process more than 10 concurrent requests at a time.

Why are concurrent lambda requests being kicked off late?

I'm running load tests on AWS Lambda with Charlesproxy, but am confused by the timelines chart produced. I've setup a test with 100 concurrent connections and expect varying degrees of latency, but expect all 100 requests to be kicked off at the same time (hence concurrent setting in charlesproxy repeat advanced feature), but I'm seeing some requests get started a bit late ... that is if I understand the chart correctly.
With only 100 invocations, I should be well within the concurrency max set by AWS Lambda, so why then are these request being kicked off late (see requests 55 - 62 on attached image)?

Lambda can take from a few hundred milliseconds to 1-2 seconds to start up when it's in "cold state". Cold means it needs to download your package, unpack it, load in memory, then start executing your code. After execution, this container is kept "alive" for about 5 to 30 minutes ("warm state"). If you request again while it's warm, container startup is much faster.
You probably had a few containers already warm when you started your test. Those started up faster. Since the other requests came concurrently, Lambda needed to start more containers and those came from a "cold state", thus the time difference you see in the chart.

JMETER is very slow after a few hours

I'm using JMeter 3.1.1 to run a load test. My test plan is with 40 threads and each thread executes 6 HTTP Requests. It is running fine for the first few hours with a latency of around 20ms.
After few hours, latency grows up to 500ms. I verified that server is processing fine. Also, I have no 'Listeners' in my test plan and I run it in NonUI mode.
Also it seems that the thread group is executing only one thread at time. Coz I see hardly one or two requests being executed by thread group per second.
Im really clueless what to suspect. Any help would be greatly appreciated.
BTW., memory and CPU consumption are normal.
About my TestPlan:
Total Thread Groups:4
1. Setup Thread Group
2. Load test thread group with 40 threads
(Action To be taken after error :Continue
Ramp-Up period: 0
Number of Threads: 40
Loop Count: Forever)
2.1 Counter
2.2 Random Variable
2.3 User Defined Variables
2.4 If Condition = true
- 2.4.1 HTTP Request1
- 2.4.2 HTTP Request2
- 2.4.2 Loop for 5 times
-- 2.4.2.1 HTTP Request1
-- 2.4.2.2 HTTP Request2
3. Introspection thread group with 1 thread
4. Tear Down thread group
Please let me know, if more details are needed
Another observation is:
Server is having TIME_WAITs : 4418 (I check 'Use keep-alive' option for HTTP Request., still so many TIME_WAITs)
Latest Observations(Thanks to one and all for your valuable comments)
Actually, the memory must be an issue., I have given already like this.,
-Xms512m -Xmx2048m -XX:NewSize=512m -XX:MaxNewSize=2048m
But I really wonder why JVM was not going 512 MB. So i tried with both Xms ans Xmx with 2g each. Now its kind of running for more longer. However it' performance is slowing down still. May be my Beanshell Post Processors are consuming all the memory. I really wonder why they are not releasing the memory. If you see., per hour how the performance is degraded.
Hour #Requests sent
---- --------------
Hour 1: 1471917
Hour 2: 1084182 (Seems all 2g heap is used up by this time)
Hour 3: 705471
Hour 4: 442912
Hour 5: 255826
Hour 6: 136292
I read that Beanshell hogs memory, but I have no choice but to use it as I have to use a third party jar with in the sampler to make few java calls. I'm not sure if I can do the same using JSR223(Groovy) or any other better performing sampler (pre / post processor)

as you have seen the Heap settings I did., here are the Memory and CPU Utilization of Jmeter. Im running 100 threads now. What should I do in my test plan to reduce the CPU utilization. I have a 100ms sleep after every 4 HTTP Requests.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11850 xxx 20 0 7776m 2.1g 4744 S 678.2 27.4 6421:50 java
%CPU: 678.2 (Fluctuating between 99% - 700%)
MEM: 2.1g (Xmx = 2g)

1) Do you run the JMeter with standard script (jmeter/jmeter.bat)?
Then mind that the default size for JVM heap in there is capped at 512M. Consider increasing it, at least at the maximum end (means, change default -Xmx512m).
Next thing to consider is the -XX:NewSize=128m -XX:MaxNewSize=128m default values.
Here's what Oracle suggesting:
In heavy throughput environments, you should consider using this
option to increase the size of the JVM young generation. By default,
the young generation is quite small, and high throughput scenarios can
result in a large amount of generated garbage. This garbage
collection, in turn, causes the JVM to inadvertently promote
short-lived objects into the old generation.
So try to play with this parameters, that may help.
P.S. Aren't you, by chance, running it at AWS EC2 instance? If yes - what's the instance type?

Thanks to all, who all tried to help me.
However., I could resolve this.,
Culprit is : If Controller, which is evaluating the condition for every iteration of every thread. Sounds quite normal isn't it? The problem is., condition is evaluation is JavaScript based. So all threads are eating CPU and Memory for invoking JavaScript.
Now Im getting consistent requests to server and JMeter is also almost stable at 1.9g of memory for 100 threads.
Im posting this just in case any one can get benefited without wasting day and nights figuring out the issue :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string